--- language: ar license: mit tags: - arabic - egyptian-arabic - hate-speech - text-classification pipeline_tag: text-classification base_model: aubmindlab/bert-base-arabertv2 library_name: transformers --- # 🧠 Egyptian Dialect Text Classification by Fine-tuning AraBERT 🇪🇬 This model is a fine-tuned version of `aubmindlab/bert-base-arabertv2`, specifically trained on Egyptian Arabic text for hate speech and offensive language classification with 91% accurency. It classifies text into 6 categories : | Category | Label | |--------------------------|---------------------------| | Offensive | Offensive text | | Neutral | Neutral text | | Racism | Racism | | Sexism | Sexism | | Religious Discrimination | Religious discrimination | | Ads | Advertisements | --- ## ⚙️ Training Details - **Base Model:** AraBERT v2 (`aubmindlab/bert-base-arabertv2`) - **Dataset:** Egyptian Arabic hate speech dataset with 6 labeled categories - **Epochs:** Trained for 30 epochs with **early stopping** after 5 epochs without improvement - **Optimizer Enhancements:** Used **label smoothing** with a factor of 0.1 for better generalization - **Best Model Selection:** Based on **weighted average F1-score** --- ## 📊 Performance Metrics Training was done using a GPU (`cuda`). Below is a snapshot of model performance across the first 15 epochs: | Epoch | Train Loss | Val Loss | Accuracy | Precision | Recall | F1 | |-------|------------|----------|----------|-----------|--------|--------| | 1 | 1.6891 | 1.5227 | 0.4980 | 0.5027 | 0.4980 | 0.4741 | | 2 | 1.1953 | 0.9491 | 0.7784 | 0.7883 | 0.7784 | 0.7758 | | 3 | 0.7612 | 0.7010 | 0.8670 | 0.8693 | 0.8670 | 0.8673 | | 4 | 0.6265 | 0.6363 | 0.9035 | 0.9042 | 0.9035 | 0.9031 | | 5 | 0.5505 | 0.6547 | 0.8996 | 0.8995 | 0.8996 | 0.8990 | | 6 | 0.5119 | 0.6861 | 0.8931 | 0.9018 | 0.8931 | 0.8947 | | 7 | 0.4779 | 0.6675 | 0.9048 | 0.9066 | 0.9048 | 0.9052 | | 8 | 0.4673 | 0.6353 | 0.9218 | 0.9238 | 0.9218 | 0.9222 | | 9 | 0.4542 | 0.6614 | 0.9126 | 0.9136 | 0.9126 | 0.9125 | | 10 | 0.4444 | 0.6618 | 0.9231 | 0.9238 | 0.9231 | 0.9233 | | 11 | 0.4359 | 0.6689 | 0.9231 | 0.9235 | 0.9231 | 0.9230 | | 12 | 0.4344 | 0.7120 | 0.9061 | 0.9097 | 0.9061 | 0.9067 | | 13 | 0.4325 | 0.7248 | 0.9061 | 0.9105 | 0.9061 | 0.9068 | | 14 | 0.4369 | 0.6946 | 0.9179 | 0.9221 | 0.9179 | 0.9189 | | 15 | 0.4289 | 0.6864 | 0.9153 | 0.9171 | 0.9153 | 0.9157 | --- ## 📢 Using the Model Without a Token (For End Users) Since the model `Woolv7007/egyptian-text-classification` and the `labels.json` file are publicly available, you can load the model, tokenizer, and labels directly without needing any Hugging Face token or special setup. ### How to automatically load and use the model with labels: - Download the `labels.json` file directly from Hugging Face Hub. - Load the model and tokenizer without a token. - Perform prediction on your texts and map the predicted class index to its label from the labels list. - Or you can call the labels.json file address with this code ``` labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json" labels = requests.get(labels_url).json() ``` --- ### Example Python Code: ```python import requests import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # Model name on Hugging Face Hub model_name = "Woolv7007/egyptian-text-classification" # Load labels.json from the public repository without a token labels_url = f"https://huggingface.co/{model_name}/resolve/main/labels.json" labels = requests.get(labels_url).json() # Load model and tokenizer without a token tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) model.eval() # Simple prediction function that returns the predicted label def predict(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) pred_id = torch.argmax(outputs.logits, dim=1).item() return labels[pred_id] # Verbose prediction with probability scores for each label (optional) def predict_verbose(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256) with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=1).squeeze().tolist() for label, prob in zip(labels, probs): print(f"{label}: {prob:.2%}") return labels[torch.argmax(logits).item()] # Example usage text = "طز في اي حد مش عاجبه شغلي" print("Text:", text) print("Predicted label:", predict(text)) # To see detailed output with probabilities, uncomment: #print("Prediction with probabilities:") #predict_verbose(text) ```