---
language: en
license: mit  # Or choose another like 'apache-2.0', 'cc-by-sa-4.0', etc.
library_name: transformers
tags:
  - text-classification
  - hate-speech
  - offensive-language
  - distilbert
  - tensorflow
pipeline_tag: text-classification
widget:
  - text: "I love this beautiful day, it's fantastic!"
    example_title: "Positive Example"
  - text: "You are a terrible person and I wish you the worst."
    example_title: "Offensive Example"
  - text: "This is a completely neutral statement about clouds."
    example_title: "Neutral Example"
  - text: "Kill all of them, they don't belong in our country." # Potentially strong hate speech
    example_title: "Hate Speech Example"
model-index:
  - name: distilbert-hatespeech-classifier # Should match your model name
    results:
    - task:
        type: text-classification
        name: Text Classification
      dataset:
        name: tdavidson/hate_speech_offensive # Or the specific name you used
        type: hf # Indicates it's from Hugging Face datasets
      metrics:
        - name: Validation Accuracy
          type: accuracy
          value: 0.7137 # Your best validation accuracy (from Epoch 2)
        - name: Validation Loss
          type: loss
          value: 0.7337 # Your best validation loss (from Epoch 2)
---


# Ethical-Content-Moderation
Fine-Tuning DistilBERT for Ethical Content Moderation

## Model description

This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:

- (a) hate speech
- (b) offensive but not hate
- (c) neither

Using a frozen DistilBERT base and a custom dense head.

The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.


## Intended uses & limitations

Intended uses

- As a starting point for transfer learning in NLP and AI ethics projects

- Academic research on hate speech and offensive language detection

- As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)

Limitations
Not suitable for real-time production use without further robustness testing

Trained on English Twitter data (2017) — performance on other domains or languages may be poor

Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section

## Training and evaluation data

Dataset:
Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)

Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)

Split: 80% training, 20% validation (stratified)


## Training procedure

Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.

Loss: Sparse categorical crossentropy

Optimizer: Adam (learning rate = 3e-5)

Batch size: 16

Class weighting: Used to compensate for class imbalance (higher weight for “hate”)

Early stopping: Custom callback at val_accuracy ≥ 0.92 

Hardware: Google Colab (Tesla T4 GPU)

### Training hyperparameters

The following hyperparameters were used during training:
- optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
- training_precision: float32

### Training results

| Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Epoch |
|:----------:|:--------------:|:---------------:|:-------------------:|:-----:|
| 1.4634     | 0.4236         | 0.9268          | 0.6454              | 1     |
| 1.1659     | 0.5067         | 0.9578          | 0.6480              | 2     |
| 1.0965     | 0.5388         | 0.8224          | 0.7043              | 3     |
| 1.0026     | 0.5667         | 0.8131          | 0.7051              | 4     |
| 0.9948     | 0.5817         | 0.8264          | 0.6940              | 5     |
| 0.9631     | 0.5921         | 0.7893          | 0.7111              | 6     |
| 0.9431     | 0.6009         | 0.7725          | 0.7252              | 7     |
| 0.9019     | 0.6197         | 0.8177          | 0.7049              | 8     |
| 0.8790     | 0.6247         | 0.7408          | 0.7351              | 9     |
| 0.8578     | 0.6309         | 0.7786          | 0.7176              | 10    |
| 0.8275     | 0.6455         | 0.7387          | 0.7331              | 11    |
| 0.8530     | 0.6411         | 0.7253          | 0.7273              | 12    |
| 0.8197     | 0.6506         | 0.7430          | 0.7293              | 13    |
| 0.8145     | 0.6549         | 0.7535          | 0.7162              | 14    |
| 0.8081     | 0.6631         | 0.7207          | 0.7402              | 15    |

### Best validation accuracy:
0.7402 at epoch 15

### Environmental Impact
Training emissions:
Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)

### Fairness & Bias

Bias/fairness audit:
The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations. 
See Appendix B of the project report for details.

### If you use this model, please cite:

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.

 William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier


### Framework versions

- Transformers 4.51.3
- TensorFlow 2.18.0
- Datasets 3.6.0
- Tokenizers 0.21.1