Egyptian Dialect Text Classification by Fine-tuning AraBERT 🇪🇬

This model is a fine-tuned version of aubmindlab/bert-base-arabertv2, specifically trained on Egyptian Arabic text for hate speech and offensive language classification

with 92% accurency.

It classifies text into 6 categories :

Category	Label
Offensive	Offensive text
Neutral	Neutral text
Racism	Racism
Sexism	Sexism
Religious Discrimination	Religious discrimination
Ads	Advertisements

Training Details

Base Model: AraBERT v2 (aubmindlab/bert-base-arabertv2)
Dataset: Egyptian Arabic hate speech dataset with 6 labeled categories
Epochs: Trained for 30 epochs with early stopping after 5 epochs without improvement
Optimizer Enhancements: Used label smoothing with a factor of 0.1 for better generalization
Best Model Selection: Based on weighted average F1-score

Performance Metrics

Training was done using a GPU (cuda). Below is a snapshot of model performance across the first 15 epochs:

Epoch	Train Loss	Val Loss	Accuracy	Precision	Recall	F1
1	1.6891	1.5227	0.4980	0.5027	0.4980	0.4741
2	1.1953	0.9491	0.7784	0.7883	0.7784	0.7758
3	0.7612	0.7010	0.8670	0.8693	0.8670	0.8673
4	0.6265	0.6363	0.9035	0.9042	0.9035	0.9031
5	0.5505	0.6547	0.8996	0.8995	0.8996	0.8990
6	0.5119	0.6861	0.8931	0.9018	0.8931	0.8947
7	0.4779	0.6675	0.9048	0.9066	0.9048	0.9052
8	0.4673	0.6353	0.9218	0.9238	0.9218	0.9222
9	0.4542	0.6614	0.9126	0.9136	0.9126	0.9125
10	0.4444	0.6618	0.9231	0.9238	0.9231	0.9233
11	0.4359	0.6689	0.9231	0.9235	0.9231	0.9230
12	0.4344	0.7120	0.9061	0.9097	0.9061	0.9067
13	0.4325	0.7248	0.9061	0.9105	0.9061	0.9068
14	0.4369	0.6946	0.9179	0.9221	0.9179	0.9189
15	0.4289	0.6864	0.9153	0.9171	0.9153	0.9157

Using the Model Without a Token (For End Users)

Since the model Woolv7007/egyptian-text-classification and the labels.json file are publicly available, you can load the model, tokenizer, and labels directly without needing any Hugging Face token or special setup.

How to automatically load and use the model with labels:

Download the labels.json file directly from Hugging Face Hub.
Load the model and tokenizer without a token.
Perform prediction on your texts and map the predicted class index to its label from the labels list.
Or you can call the labels.json file address with this code

labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

Example Python Code:

import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Model name on Hugging Face Hub
model_name = "Woolv7007/egyptian-text-classification"

# Load labels.json from the public repository without a token
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

# Load model and tokenizer without a token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Simple prediction function that returns the predicted label
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
        pred_id = torch.argmax(outputs.logits, dim=1).item()
    return labels[pred_id]

# Verbose prediction with probability scores for each label (optional)
def predict_verbose(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=1).squeeze().tolist()
    for label, prob in zip(labels, probs):
        print(f"{label}: {prob:.2%}")
    return labels[torch.argmax(logits).item()]

# Example usage
text = "طز في اي حد مش عاجبه شغلي"
print("Text:", text)
print("Predicted label:", predict(text))

# To see detailed output with probabilities, uncomment:
#print("Prediction with probabilities:")
#predict_verbose(text)

Woolv7007
/

egyptian-text-classification