Egyptian Dialect Text Classification by Fine-tuning AraBERT ๐Ÿ‡ช๐Ÿ‡ฌ

This model is a fine-tuned version of aubmindlab/bert-base-arabertv2, specifically trained on Egyptian Arabic text for hate speech and offensive language classification

with 92% accurency.

It classifies text into 6 categories :

Category Label
Offensive Offensive text
Neutral Neutral text
Racism Racism
Sexism Sexism
Religious Discrimination Religious discrimination
Ads Advertisements

Training Details

  • Base Model: AraBERT v2 (aubmindlab/bert-base-arabertv2)
  • Dataset: Egyptian Arabic hate speech dataset with 6 labeled categories
  • Epochs: Trained for 30 epochs with early stopping after 5 epochs without improvement
  • Optimizer Enhancements: Used label smoothing with a factor of 0.1 for better generalization
  • Best Model Selection: Based on weighted average F1-score

Performance Metrics

Training was done using a GPU (cuda). Below is a snapshot of model performance across the first 15 epochs:

Epoch Train Loss Val Loss Accuracy Precision Recall F1
1 1.6891 1.5227 0.4980 0.5027 0.4980 0.4741
2 1.1953 0.9491 0.7784 0.7883 0.7784 0.7758
3 0.7612 0.7010 0.8670 0.8693 0.8670 0.8673
4 0.6265 0.6363 0.9035 0.9042 0.9035 0.9031
5 0.5505 0.6547 0.8996 0.8995 0.8996 0.8990
6 0.5119 0.6861 0.8931 0.9018 0.8931 0.8947
7 0.4779 0.6675 0.9048 0.9066 0.9048 0.9052
8 0.4673 0.6353 0.9218 0.9238 0.9218 0.9222
9 0.4542 0.6614 0.9126 0.9136 0.9126 0.9125
10 0.4444 0.6618 0.9231 0.9238 0.9231 0.9233
11 0.4359 0.6689 0.9231 0.9235 0.9231 0.9230
12 0.4344 0.7120 0.9061 0.9097 0.9061 0.9067
13 0.4325 0.7248 0.9061 0.9105 0.9061 0.9068
14 0.4369 0.6946 0.9179 0.9221 0.9179 0.9189
15 0.4289 0.6864 0.9153 0.9171 0.9153 0.9157

Using the Model Without a Token (For End Users)

Since the model Woolv7007/egyptian-text-classification and the labels.json file are publicly available, you can load the model, tokenizer, and labels directly without needing any Hugging Face token or special setup.

How to automatically load and use the model with labels:

  • Download the labels.json file directly from Hugging Face Hub.
  • Load the model and tokenizer without a token.
  • Perform prediction on your texts and map the predicted class index to its label from the labels list.
  • Or you can call the labels.json file address with this code
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

Example Python Code:

import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Model name on Hugging Face Hub
model_name = "Woolv7007/egyptian-text-classification"

# Load labels.json from the public repository without a token
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

# Load model and tokenizer without a token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Simple prediction function that returns the predicted label
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
        pred_id = torch.argmax(outputs.logits, dim=1).item()
    return labels[pred_id]

# Verbose prediction with probability scores for each label (optional)
def predict_verbose(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=1).squeeze().tolist()
    for label, prob in zip(labels, probs):
        print(f"{label}: {prob:.2%}")
    return labels[torch.argmax(logits).item()]

# Example usage
text = "ุทุฒ ููŠ ุงูŠ ุญุฏ ู…ุด ุนุงุฌุจู‡ ุดุบู„ูŠ"
print("Text:", text)
print("Predicted label:", predict(text))

# To see detailed output with probabilities, uncomment:
#print("Prediction with probabilities:")
#predict_verbose(text)
Downloads last month
9
Safetensors
Model size
135M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 2 Ask for provider support

Model tree for Woolv7007/egyptian-text-classification

Finetuned
(60)
this model

Space using Woolv7007/egyptian-text-classification 1