Ethical-Content-Moderation

Fine-Tuning DistilBERT for Ethical Content Moderation

Model description

This model fine-tunes distilbert-base-uncased on the Davidson et al. (2017) hate speech and offensive language dataset loaded from HuggingFace. The classifier predicts whether a tweet is:

  • (a) hate speech
  • (b) offensive but not hate
  • (c) neither

Using a frozen DistilBERT base and a custom dense head.

The architecture consists of three dense layers (256 → 128 → 32, LeakyReLU and Swish activations), with dropout and batch normalization to improve generalization.

Intended uses & limitations

Intended uses

  • As a starting point for transfer learning in NLP and AI ethics projects

  • Academic research on hate speech and offensive language detection

  • As a fast, lightweight screening tool for moderating user-generated content (e.g., tweets, comments, reviews)

Limitations Not suitable for real-time production use without further robustness testing

Trained on English Twitter data (2017) — performance on other domains or languages may be poor

Does not guarantee removal of all forms of bias or unfairness; see Fairness & Bias section

Training and evaluation data

Dataset: Davidson et al., 2017 (24K+ English tweets, labeled as hate, offensive, or neither)

Class distribution: Imbalanced (majority: “offensive”; minority: “hate”)

Split: 80% training, 20% validation (stratified)

Training procedure

Frozen base: DistilBERT transformer weights frozen; only dense classifier head is trained.

Loss: Sparse categorical crossentropy

Optimizer: Adam (learning rate = 3e-5)

Batch size: 16

Class weighting: Used to compensate for class imbalance (higher weight for “hate”)

Early stopping: Custom callback at val_accuracy ≥ 0.92

Hardware: Google Colab (Tesla T4 GPU)

Training hyperparameters

The following hyperparameters were used during training:

  • optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': np.float32(3e-05), 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
  • training_precision: float32

Training results

Train Loss Train Accuracy Validation Loss Validation Accuracy Epoch
1.4634 0.4236 0.9268 0.6454 1
1.1659 0.5067 0.9578 0.6480 2
1.0965 0.5388 0.8224 0.7043 3
1.0026 0.5667 0.8131 0.7051 4
0.9948 0.5817 0.8264 0.6940 5
0.9631 0.5921 0.7893 0.7111 6
0.9431 0.6009 0.7725 0.7252 7
0.9019 0.6197 0.8177 0.7049 8
0.8790 0.6247 0.7408 0.7351 9
0.8578 0.6309 0.7786 0.7176 10
0.8275 0.6455 0.7387 0.7331 11
0.8530 0.6411 0.7253 0.7273 12
0.8197 0.6506 0.7430 0.7293 13
0.8145 0.6549 0.7535 0.7162 14
0.8081 0.6631 0.7207 0.7402 15

Best validation accuracy:

0.7402 at epoch 15

Environmental Impact

Training emissions: Estimated at 0.0273 kg CO₂ (CodeCarbon, Colab T4 GPU)

Fairness & Bias

Bias/fairness audit: The model was evaluated on synthetic gender pronoun tests and showed relatively balanced outputs, but biases may remain due to dataset limitations. See Appendix B of the project report for details.

If you use this model, please cite:

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM 2017.

William Radiyeh. DistilBERT Hate Speech Classifier (2025). https://huggingface.co/will-rads/distilbert-hatespeech-classifier

Framework versions

  • Transformers 4.51.3
  • TensorFlow 2.18.0
  • Datasets 3.6.0
  • Tokenizers 0.21.1
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • Validation Accuracy on tdavidson/hate_speech_offensive
    self-reported
    0.714
  • Validation Loss on tdavidson/hate_speech_offensive
    self-reported
    0.734