---
language: ar
license: mit
tags:
- arabic
- egyptian-arabic
- hate-speech
- text-classification
pipeline_tag: text-classification
base_model: aubmindlab/bert-base-arabertv2
library_name: transformers
---

# 🧠 Egyptian Dialect Text Classification by Fine-tuning AraBERT 🇪🇬

This model is a fine-tuned version of `aubmindlab/bert-base-arabertv2`, specifically trained on Egyptian Arabic text for hate speech and offensive language classification

with 91% accurency. 

It classifies text into 6 categories :

| Category                 | Label                     |
|--------------------------|---------------------------|
| Offensive                | Offensive text            |
| Neutral                  | Neutral text              |
| Racism                   | Racism                    |
| Sexism                   | Sexism                    |
| Religious Discrimination | Religious discrimination  |
| Ads                      | Advertisements            |

---

## ⚙️ Training Details

- **Base Model:** AraBERT v2 (`aubmindlab/bert-base-arabertv2`)
- **Dataset:** Egyptian Arabic hate speech dataset with 6 labeled categories
- **Epochs:** Trained for 30 epochs with **early stopping** after 5 epochs without improvement
- **Optimizer Enhancements:** Used **label smoothing** with a factor of 0.1 for better generalization
- **Best Model Selection:** Based on **weighted average F1-score**

---

## 📊 Performance Metrics

Training was done using a GPU (`cuda`). Below is a snapshot of model performance across the first 15 epochs:

| Epoch | Train Loss | Val Loss | Accuracy | Precision | Recall | F1     |
|-------|------------|----------|----------|-----------|--------|--------|
| 1     | 1.6891     | 1.5227   | 0.4980   | 0.5027    | 0.4980 | 0.4741 |
| 2     | 1.1953     | 0.9491   | 0.7784   | 0.7883    | 0.7784 | 0.7758 |
| 3     | 0.7612     | 0.7010   | 0.8670   | 0.8693    | 0.8670 | 0.8673 |
| 4     | 0.6265     | 0.6363   | 0.9035   | 0.9042    | 0.9035 | 0.9031 |
| 5     | 0.5505     | 0.6547   | 0.8996   | 0.8995    | 0.8996 | 0.8990 |
| 6     | 0.5119     | 0.6861   | 0.8931   | 0.9018    | 0.8931 | 0.8947 |
| 7     | 0.4779     | 0.6675   | 0.9048   | 0.9066    | 0.9048 | 0.9052 |
| 8     | 0.4673     | 0.6353   | 0.9218   | 0.9238    | 0.9218 | 0.9222 |
| 9     | 0.4542     | 0.6614   | 0.9126   | 0.9136    | 0.9126 | 0.9125 |
| 10    | 0.4444     | 0.6618   | 0.9231   | 0.9238    | 0.9231 | 0.9233 |
| 11    | 0.4359     | 0.6689   | 0.9231   | 0.9235    | 0.9231 | 0.9230 |
| 12    | 0.4344     | 0.7120   | 0.9061   | 0.9097    | 0.9061 | 0.9067 |
| 13    | 0.4325     | 0.7248   | 0.9061   | 0.9105    | 0.9061 | 0.9068 |
| 14    | 0.4369     | 0.6946   | 0.9179   | 0.9221    | 0.9179 | 0.9189 |
| 15    | 0.4289     | 0.6864   | 0.9153   | 0.9171    | 0.9153 | 0.9157 |

---
## 📢 Using the Model Without a Token (For End Users)

Since the model `Woolv7007/egyptian-text-classification` and the `labels.json` file are publicly available, you can load the model, tokenizer, and labels directly without needing any Hugging Face token or special setup.

### How to automatically load and use the model with labels:

- Download the `labels.json` file directly from Hugging Face Hub.  
- Load the model and tokenizer without a token.  
- Perform prediction on your texts and map the predicted class index to its label from the labels list.
- Or you can call the labels.json file address with this code

```
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

```
---

### Example Python Code:

```python
import requests
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Model name on Hugging Face Hub
model_name = "Woolv7007/egyptian-text-classification"

# Load labels.json from the public repository without a token
labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json"
labels = requests.get(labels_url).json()

# Load model and tokenizer without a token
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Simple prediction function that returns the predicted label
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
        pred_id = torch.argmax(outputs.logits, dim=1).item()
    return labels[pred_id]

# Verbose prediction with probability scores for each label (optional)
def predict_verbose(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=1).squeeze().tolist()
    for label, prob in zip(labels, probs):
        print(f"{label}: {prob:.2%}")
    return labels[torch.argmax(logits).item()]

# Example usage
text = "طز في اي حد مش عاجبه شغلي"
print("Text:", text)
print("Predicted label:", predict(text))

# To see detailed output with probabilities, uncomment:
#print("Prediction with probabilities:")
#predict_verbose(text)

```