--- language: ar license: mit tags: - arabic - eou-detection - turn-detection - conversation - smollm2 - livekit - levantine - egyptian - gulf base_model: HuggingFaceTB/SmolLM2-135M datasets: - Reverb/arabic-eou-conversations metrics: - f1 - accuracy - precision - recall - auc model-index: - name: SmolLM2-135M-Arabic-EOU results: - task: type: text-classification name: End-of-Utterance Detection dataset: name: Arabic EOU Conversations type: Reverb/arabic-eou-conversations metrics: - type: f1 value: 0.913 name: F1 Score - type: accuracy value: 0.913 name: Accuracy - type: precision value: 0.906 name: Precision - type: recall value: 0.921 name: Recall - type: auc value: 0.958 name: AUC-ROC --- # SmolLM2-135M Arabic End-of-Utterance Detector Fine-tuned SmolLM2-135M model for detecting end-of-utterance (EOU) in Arabic conversations. ## Model Description This model predicts when an Arabic speaker has finished their turn in a conversation based on transcribed speech. It's designed for real-time voice assistants, LiveKit agents, and conversational AI systems. **Key Features:** - 🎯 **High Accuracy**: F1-Score of 0.913 - 🌍 **Multi-Dialect**: Supports Levantine, Egyptian, and Gulf Arabic - ⚡ **Fast Inference**: <50ms per prediction on GPU - 🔄 **Context-Aware**: Can use previous utterances for better predictions - 🎙️ **Production-Ready**: Integrated with LiveKit for real-time use ## Performance | Metric | Score | |--------|-------| | F1 Score | **0.913** | | Accuracy | **0.913** | | Precision | 0.906 | | Recall | 0.921 | | AUC-ROC | 0.958 | **Inference Speed:** - CPU: 30-50ms per prediction - GPU (RTX 4070): 10-20ms per prediction - Batch (32 samples): 3-6ms per prediction ## Training Details ### Training Data - **Dataset**: [Reverb/arabic-eou-conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations) - **Total Examples**: 11,660 (balanced 50/50 EOU/NOT_EOU) - **Dialects**: - Levantine (شامي) - Egyptian (مصري) - Gulf (خليجي) - **Split**: 80% train, 10% validation, 10% test ### Training Configuration - **Base Model**: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) - **Parameters**: 135 million - **Hardware**: NVIDIA RTX 4070 (8GB VRAM) - **Batch Size**: 32 (effective: 64 with gradient accumulation) - **Learning Rate**: 2e-5 - **Epochs**: 5 - **Optimizer**: AdamW - **Mixed Precision**: FP16 ## Usage ### Quick Start ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained( "Reverb/smollm2-135m-arabic-eou" ) tokenizer = AutoTokenizer.from_pretrained( "Reverb/smollm2-135m-arabic-eou" ) # Predict text = "شو رأيك نروح نتغدا؟" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) prediction = torch.argmax(probs, dim=-1).item() confidence = probs[0][prediction].item() print(f"EOU: {prediction == 1}, Confidence: {confidence:.3f}") # Output: EOU: True, Confidence: 0.952 ``` ### With Context ```python # Using previous utterance as context context = "كيف حالك؟" current = "الحمد لله بخير" text_with_context = f"{context} [SEP] {current}" inputs = tokenizer(text_with_context, return_tensors="pt", max_length=256, truncation=True) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) is_eou = torch.argmax(probs, dim=-1).item() == 1 confidence = probs[0][1 if is_eou else 0].item() print(f"EOU: {is_eou}, Confidence: {confidence:.3f}") ``` ### Batch Prediction ```python texts = [ "شو رأيك", # Partial - NOT_EOU "شو رأيك نروح نتغدا؟" # Complete - EOU ] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) predictions = torch.argmax(probs, dim=-1) for text, pred, prob in zip(texts, predictions, probs): is_eou = pred.item() == 1 conf = prob[pred].item() print(f"'{text}' → {'EOU' if is_eou else 'NOT_EOU'} ({conf:.3f})") ``` ## Intended Use ### Primary Use Cases - **Voice Assistants**: Detect when users finish speaking - **LiveKit Agents**: Real-time turn detection in voice conversations - **Dialogue Systems**: Turn-taking in conversational AI - **Transcription Systems**: Add turn boundaries to speech transcripts - **Conversation Analysis**: Analyze turn-taking patterns ### Example Applications 1. **Real-time Voice Agent** ```python # Process STT transcription is_eou, confidence = detect_eou(transcription) if is_eou and confidence > 0.7: # User finished speaking, generate response agent_response = generate_response(transcription) ``` 2. **LiveKit Integration** ```python from livekit_eou_sdk import ArabicEOUTurnDetector detector = ArabicEOUTurnDetector(threshold=0.7) is_eou, conf = await detector.process_transcription(text, is_final=True) ``` ## Limitations - **Dialect Coverage**: Optimized for Levantine, Egyptian, and Gulf dialects. May not perform as well on other Arabic dialects. - **Formal Arabic**: Designed for conversational/colloquial Arabic. Performance on Modern Standard Arabic (MSA) or Classical Arabic may vary. - **Domain**: Trained on general conversational data. May require fine-tuning for specialized domains (medical, legal, etc.). - **Context**: Best results when using conversation context. Single utterances without context may have lower accuracy. - **Spoken Language**: Designed for transcribed spoken language, not written text. ## Bias and Fairness - The model was trained on balanced data across three major Arabic dialects - Performance is consistent across all three dialects (Levantine, Egyptian, Gulf) - May have reduced performance on underrepresented dialects or regional variations - No demographic or gender-based biases were intentionally introduced ## Model Architecture - **Type**: Sequence Classification (Binary) - **Base**: LlamaForSequenceClassification (SmolLM2-135M) - **Input**: Arabic text (max 256 tokens) - **Output**: Binary classification (0=NOT_EOU, 1=EOU) - **Classes**: 2 (NOT_EOU, EOU) - **Model Size**: ~270MB ## Citation ```bibtex @misc{arabic-eou-detector-2025, author = {Reverb}, title = {Arabic End-of-Utterance Detector}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Reverb/smollm2-135m-arabic-eou}}, note = {Fine-tuned SmolLM2-135M for Arabic EOU detection} } ``` ## License MIT License ## Acknowledgments - **Base Model**: [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) by Hugging Face - **Framework**: PyTorch, Transformers - **Dataset**: [Arabic EOU Conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations) ## Contact For questions or issues, please open an issue on the [model repository](https://huggingface.co/Reverb/smollm2-135m-arabic-eou/discussions). ## Related Resources - **Dataset**: [Reverb/arabic-eou-conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations) - **Code Repository**: Available in model files - **LiveKit SDK**: Included for real-time integration --- **Model Card Version**: 1.0 **Last Updated**: December 2025