Egyptian Dialect Text Classification by Fine-tuning AraBERT ๐ช๐ฌ
This model is a fine-tuned version of aubmindlab/bert-base-arabertv2
, specifically trained on Egyptian Arabic text for hate speech and offensive language classification
with 92% accurency.
It classifies text into 6 categories :
Category | Label |
---|---|
Offensive | Offensive text |
Neutral | Neutral text |
Racism | Racism |
Sexism | Sexism |
Religious Discrimination | Religious discrimination |
Ads | Advertisements |
Training Details
- Base Model: AraBERT v2 (
aubmindlab/bert-base-arabertv2
) - Dataset: Egyptian Arabic hate speech dataset with 6 labeled categories
- Epochs: Trained for 30 epochs with early stopping after 5 epochs without improvement
- Optimizer Enhancements: Used label smoothing with a factor of 0.1 for better generalization
- Best Model Selection: Based on weighted average F1-score
Performance Metrics
Training was done using a GPU (cuda
). Below is a snapshot of model performance across the first 15 epochs:
Epoch | Train Loss | Val Loss | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|---|
1 | 1.6891 | 1.5227 | 0.4980 | 0.5027 | 0.4980 | 0.4741 |
2 | 1.1953 | 0.9491 | 0.7784 | 0.7883 | 0.7784 | 0.7758 |
3 | 0.7612 | 0.7010 | 0.8670 | 0.8693 | 0.8670 | 0.8673 |
4 | 0.6265 | 0.6363 | 0.9035 | 0.9042 | 0.9035 | 0.9031 |
5 | 0.5505 | 0.6547 | 0.8996 | 0.8995 | 0.8996 | 0.8990 |
6 | 0.5119 | 0.6861 | 0.8931 | 0.9018 | 0.8931 | 0.8947 |
7 | 0.4779 | 0.6675 | 0.9048 | 0.9066 | 0.9048 | 0.9052 |
8 | 0.4673 | 0.6353 | 0.9218 | 0.9238 | 0.9218 | 0.9222 |
9 | 0.4542 | 0.6614 | 0.9126 | 0.9136 | 0.9126 | 0.9125 |
10 | 0.4444 | 0.6618 | 0.9231 | 0.9238 | 0.9231 | 0.9233 |
11 | 0.4359 | 0.6689 | 0.9231 | 0.9235 | 0.9231 | 0.9230 |
12 | 0.4344 | 0.7120 | 0.9061 | 0.9097 | 0.9061 | 0.9067 |
13 | 0.4325 | 0.7248 | 0.9061 | 0.9105 | 0.9061 | 0.9068 |
14 | 0.4369 | 0.6946 | 0.9179 | 0.9221 | 0.9179 | 0.9189 |
15 | 0.4289 | 0.6864 | 0.9153 | 0.9171 | 0.9153 | 0.9157 |
Using the Model Without a Token (For End Users)
Since the model Woolv7007/egyptian-text-classification
and the labels.json
file are publicly available, you can load the model, tokenizer, and labels directly without needing any Hugging Face token or special setup.
How to automatically load and use the model with labels:
- Download the
labels.json
file directly from Hugging Face Hub. - Load the model and tokenizer without a token.
- Perform prediction on your texts and map the predicted class index to its label from the labels list.
- Or you can call the labels.json file address with this code
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()
Example Python Code:
import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Model name on Hugging Face Hub
model_name = "Woolv7007/egyptian-text-classification"
# Load labels.json from the public repository without a token
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()
# Load model and tokenizer without a token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
# Simple prediction function that returns the predicted label
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
pred_id = torch.argmax(outputs.logits, dim=1).item()
return labels[pred_id]
# Verbose prediction with probability scores for each label (optional)
def predict_verbose(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=1).squeeze().tolist()
for label, prob in zip(labels, probs):
print(f"{label}: {prob:.2%}")
return labels[torch.argmax(logits).item()]
# Example usage
text = "ุทุฒ ูู ุงู ุญุฏ ู
ุด ุนุงุฌุจู ุดุบูู"
print("Text:", text)
print("Predicted label:", predict(text))
# To see detailed output with probabilities, uncomment:
#print("Prediction with probabilities:")
#predict_verbose(text)
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
2
Ask for provider support
Model tree for Woolv7007/egyptian-text-classification
Base model
aubmindlab/bert-base-arabertv2