--- language: en license: mit # Or choose another like 'apache-2.0', 'cc-by-sa-4.0', etc. library_name: transformers tags: - text-classification - hate-speech - offensive-language - distilbert - tensorflow pipeline_tag: text-classification widget: - text: "I love this beautiful day, it's fantastic!" example_title: "Positive Example" - text: "You are a terrible person and I wish you the worst." example_title: "Offensive Example" - text: "This is a completely neutral statement about clouds." example_title: "Neutral Example" - text: "Kill all of them, they don't belong in our country." # Potentially strong hate speech example_title: "Hate Speech Example" model-index: - name: distilbert-hatespeech-classifier # Should match your model name results: - task: type: text-classification name: Text Classification dataset: name: tdavidson/hate_speech_offensive # Or the specific name you used type: hf # Indicates it's from Hugging Face datasets metrics: - name: Validation Accuracy type: accuracy value: 0.7137 # Your best validation accuracy (from Epoch 2) - name: Validation Loss type: loss value: 0.7337 # Your best validation loss (from Epoch 2) --- # Ethical-Content-Moderation Fine-Tuning DistilBERT for Ethical Content Moderation ## Model description This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is: - (a) hate speech - (b) offensive but not hate - (c) neither Using a frozen DistilBERT base and a custom dense head. The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization. ## Intended uses & limitations Intended uses - As a starting point for transfer learning in NLP and AI ethics projects - Academic research on hate speech and offensive language detection - As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews) Limitations Not suitable for real-time production use without further robustness testing Trained on English Twitter data (2017) — performance on other domains or languages may be poor Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section ## Training and evaluation data Dataset: Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither) Class distribution: Imbalanced (majority: “offensive”; minority: “hate”) Split: 80% training, 20% validation (stratified) ## Training procedure Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained. Loss: Sparse categorical crossentropy Optimizer: Adam (learning rate = 3e-5) Batch size: 16 Class weighting: Used to compensate for class imbalance (higher weight for “hate”) Early stopping: Custom callback at val_accuracy ≥ 0.92 Hardware: Google Colab (Tesla T4 GPU) ### Training hyperparameters The following hyperparameters were used during training: - optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False} - training_precision: float32 ### Training results | Train Loss | Train Accuracy | Validation Loss | Validation Accuracy | Epoch | |:----------:|:--------------:|:---------------:|:-------------------:|:-----:| | 1.4634 | 0.4236 | 0.9268 | 0.6454 | 1 | | 1.1659 | 0.5067 | 0.9578 | 0.6480 | 2 | | 1.0965 | 0.5388 | 0.8224 | 0.7043 | 3 | | 1.0026 | 0.5667 | 0.8131 | 0.7051 | 4 | | 0.9948 | 0.5817 | 0.8264 | 0.6940 | 5 | | 0.9631 | 0.5921 | 0.7893 | 0.7111 | 6 | | 0.9431 | 0.6009 | 0.7725 | 0.7252 | 7 | | 0.9019 | 0.6197 | 0.8177 | 0.7049 | 8 | | 0.8790 | 0.6247 | 0.7408 | 0.7351 | 9 | | 0.8578 | 0.6309 | 0.7786 | 0.7176 | 10 | | 0.8275 | 0.6455 | 0.7387 | 0.7331 | 11 | | 0.8530 | 0.6411 | 0.7253 | 0.7273 | 12 | | 0.8197 | 0.6506 | 0.7430 | 0.7293 | 13 | | 0.8145 | 0.6549 | 0.7535 | 0.7162 | 14 | | 0.8081 | 0.6631 | 0.7207 | 0.7402 | 15 | ### Best validation accuracy: 0.7402 at epoch 15 ### Environmental Impact Training emissions: Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU) ### Fairness & Bias Bias/fairness audit: The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations. See Appendix B of the project report for details. ### If you use this model, please cite: Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017. William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier ### Framework versions - Transformers 4.51.3 - TensorFlow 2.18.0 - Datasets 3.6.0 - Tokenizers 0.21.1