---
language: ar
license: mit
tags:
- arabic
- eou-detection
- turn-detection
- conversation
- smollm2
- livekit
- levantine
- egyptian
- gulf
base_model: HuggingFaceTB/SmolLM2-135M
datasets:
- Reverb/arabic-eou-conversations
metrics:
- f1
- accuracy
- precision
- recall
- auc
model-index:
- name: SmolLM2-135M-Arabic-EOU
  results:
  - task:
      type: text-classification
      name: End-of-Utterance Detection
    dataset:
      name: Arabic EOU Conversations
      type: Reverb/arabic-eou-conversations
    metrics:
    - type: f1
      value: 0.913
      name: F1 Score
    - type: accuracy
      value: 0.913
      name: Accuracy
    - type: precision
      value: 0.906
      name: Precision
    - type: recall
      value: 0.921
      name: Recall
    - type: auc
      value: 0.958
      name: AUC-ROC
---

# SmolLM2-135M Arabic End-of-Utterance Detector

Fine-tuned SmolLM2-135M model for detecting end-of-utterance (EOU) in Arabic conversations.

## Model Description

This model predicts when an Arabic speaker has finished their turn in a conversation based on transcribed speech. It's designed for real-time voice assistants, LiveKit agents, and conversational AI systems.

**Key Features:**
- 🎯 **High Accuracy**: F1-Score of 0.913
- 🌍 **Multi-Dialect**: Supports Levantine, Egyptian, and Gulf Arabic
- ⚡ **Fast Inference**: <50ms per prediction on GPU
- 🔄 **Context-Aware**: Can use previous utterances for better predictions
- 🎙️ **Production-Ready**: Integrated with LiveKit for real-time use

## Performance

| Metric | Score |
|--------|-------|
| F1 Score | **0.913** |
| Accuracy | **0.913** |
| Precision | 0.906 |
| Recall | 0.921 |
| AUC-ROC | 0.958 |

**Inference Speed:**
- CPU: 30-50ms per prediction
- GPU (RTX 4070): 10-20ms per prediction
- Batch (32 samples): 3-6ms per prediction

## Training Details

### Training Data

- **Dataset**: [Reverb/arabic-eou-conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations)
- **Total Examples**: 11,660 (balanced 50/50 EOU/NOT_EOU)
- **Dialects**: 
  - Levantine (شامي)
  - Egyptian (مصري)
  - Gulf (خليجي)
- **Split**: 80% train, 10% validation, 10% test

### Training Configuration

- **Base Model**: [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)
- **Parameters**: 135 million
- **Hardware**: NVIDIA RTX 4070 (8GB VRAM)
- **Batch Size**: 32 (effective: 64 with gradient accumulation)
- **Learning Rate**: 2e-5
- **Epochs**: 5
- **Optimizer**: AdamW
- **Mixed Precision**: FP16

## Usage

### Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    "Reverb/smollm2-135m-arabic-eou"
)
tokenizer = AutoTokenizer.from_pretrained(
    "Reverb/smollm2-135m-arabic-eou"
)

# Predict
text = "شو رأيك نروح نتغدا؟"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][prediction].item()

print(f"EOU: {prediction == 1}, Confidence: {confidence:.3f}")
# Output: EOU: True, Confidence: 0.952
```

### With Context

```python
# Using previous utterance as context
context = "كيف حالك؟"
current = "الحمد لله بخير"
text_with_context = f"{context} [SEP] {current}"

inputs = tokenizer(text_with_context, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    is_eou = torch.argmax(probs, dim=-1).item() == 1
    confidence = probs[0][1 if is_eou else 0].item()

print(f"EOU: {is_eou}, Confidence: {confidence:.3f}")
```

### Batch Prediction

```python
texts = [
    "شو رأيك",           # Partial - NOT_EOU
    "شو رأيك نروح نتغدا؟"  # Complete - EOU
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1)

for text, pred, prob in zip(texts, predictions, probs):
    is_eou = pred.item() == 1
    conf = prob[pred].item()
    print(f"'{text}' → {'EOU' if is_eou else 'NOT_EOU'} ({conf:.3f})")
```

## Intended Use

### Primary Use Cases

- **Voice Assistants**: Detect when users finish speaking
- **LiveKit Agents**: Real-time turn detection in voice conversations
- **Dialogue Systems**: Turn-taking in conversational AI
- **Transcription Systems**: Add turn boundaries to speech transcripts
- **Conversation Analysis**: Analyze turn-taking patterns

### Example Applications

1. **Real-time Voice Agent**
   ```python
   # Process STT transcription
   is_eou, confidence = detect_eou(transcription)
   if is_eou and confidence > 0.7:
       # User finished speaking, generate response
       agent_response = generate_response(transcription)
   ```

2. **LiveKit Integration**
   ```python
   from livekit_eou_sdk import ArabicEOUTurnDetector
   
   detector = ArabicEOUTurnDetector(threshold=0.7)
   is_eou, conf = await detector.process_transcription(text, is_final=True)
   ```

## Limitations

- **Dialect Coverage**: Optimized for Levantine, Egyptian, and Gulf dialects. May not perform as well on other Arabic dialects.
- **Formal Arabic**: Designed for conversational/colloquial Arabic. Performance on Modern Standard Arabic (MSA) or Classical Arabic may vary.
- **Domain**: Trained on general conversational data. May require fine-tuning for specialized domains (medical, legal, etc.).
- **Context**: Best results when using conversation context. Single utterances without context may have lower accuracy.
- **Spoken Language**: Designed for transcribed spoken language, not written text.

## Bias and Fairness

- The model was trained on balanced data across three major Arabic dialects
- Performance is consistent across all three dialects (Levantine, Egyptian, Gulf)
- May have reduced performance on underrepresented dialects or regional variations
- No demographic or gender-based biases were intentionally introduced

## Model Architecture

- **Type**: Sequence Classification (Binary)
- **Base**: LlamaForSequenceClassification (SmolLM2-135M)
- **Input**: Arabic text (max 256 tokens)
- **Output**: Binary classification (0=NOT_EOU, 1=EOU)
- **Classes**: 2 (NOT_EOU, EOU)
- **Model Size**: ~270MB

## Citation

```bibtex
@misc{arabic-eou-detector-2025,
  author = {Reverb},
  title = {Arabic End-of-Utterance Detector},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Reverb/smollm2-135m-arabic-eou}},
  note = {Fine-tuned SmolLM2-135M for Arabic EOU detection}
}
```

## License

MIT License

## Acknowledgments

- **Base Model**: [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) by Hugging Face
- **Framework**: PyTorch, Transformers
- **Dataset**: [Arabic EOU Conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations)

## Contact

For questions or issues, please open an issue on the [model repository](https://huggingface.co/Reverb/smollm2-135m-arabic-eou/discussions).

## Related Resources

- **Dataset**: [Reverb/arabic-eou-conversations](https://huggingface.co/datasets/Reverb/arabic-eou-conversations)
- **Code Repository**: Available in model files
- **LiveKit SDK**: Included for real-time integration

---

**Model Card Version**: 1.0  
**Last Updated**: December 2025