π§ Model Description
This model is a fine-tuned version of facebook/nllb-200-distilled-600M, specialized for English to Egyptian Arabic (arz) translation. The model was trained to improve performance on informal, dialectal text, particularly in the context of spoken Egyptian Arabic.
The base model is part of the No Language Left Behind (NLLB) initiative.
π¬ Intended Use
This model is intended for translating English text into Egyptian Arabic (ARZ), particularly:
- Informal speech
- Conversational and social media content
- Spoken dialogue datasets
It is not recommended for use in formal or Modern Standard Arabic (MSA) contexts, as the output will reflect dialectal structures and vocabulary.
ποΈ Training Details
- Base model:
facebook/nllb-200-distilled-600M
- Target language pair: English β Egyptian Arabic (en β arz)
- Training dataset: IbrahimAmin/arz-en-parallel-corpus
- Includes subtitle translations, synthetic translations, and conversational Egyptian Arabic text
- Covers both informal and semi-formal domains
- Framework: π€ Transformers + PyTorch
- Training duration: 10 epochs
- Batch size: 12
- Learning rate: 2e-5
- Encoder: Frozen
- Precision: bf16
Epoch | Training Loss | Validation Loss |
---|---|---|
0 | No log | 12.742368 |
1 | 6.736500 | 6.469766 |
2 | 6.328200 | 6.097203 |
3 | 6.004600 | 5.790025 |
4 | 5.745800 | 5.544414 |
5 | 5.537400 | 5.364527 |
6 | 5.339400 | 5.211165 |
7 | 5.224800 | 5.101339 |
8 | 5.131800 | 5.019337 |
9 | 5.076800 | 4.990577 |
10 | 5.059100 | 4.964704 |
π§ͺ Evaluation
We evaluated the model using BLEU score on a held-out test set of EnglishβEgyptian Arabic pairs from IbrahimAmin/arz-en-parallel-corpus.
Manual inspection suggests improved handling of:
- Idiomatic expressions
- Spoken-style phrasing
- Common Egyptian dialect vocabulary
π Example
Input (en):
How are you doing today?
Output (arz):
Ψ₯Ψ²ΩΩ Ψ§ΩΩΩΨ§Ψ±Ψ―Ψ©Ψ
β οΈ Limitations
- May hallucinate or formalize certain expressions depending on the context.
- Trained primarily on synthetic and semi-formal sources; may not generalize well to highly domain-specific jargon.
- Not suitable for translating into MSA or other Arabic dialects.
π Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = AutoModelForSeq2SeqLM.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", torch_dtype=torch.float16).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", src_lang="eng_Latn", tgt_lang="arz_Arab")
article = "How are you doing today?"
inputs = tokenizer(article, return_tensors="pt").to(device)
translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("arz_Arab"))
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
# Output: 'Ψ₯Ψ²ΩΩ Ψ§ΩΩΩΨ§Ψ±Ψ―Ψ©Ψ'
β¨ License
This model is distributed under the same license as the original NLLB-200 model (CC-BY-NC 4.0).
See LICENSE for details.
π Citation
If you use this model, please cite:
@misc{ibrahimamin2025nllb200arz,
title={NLLB-200-600M English to Egyptian Arabic},
author={Ibrahim Amin},
year={2025},
howpublished={\url{https://huggingface.co/IbrahimAmin/nllb-200-distilled-600M-en-to-arz}},
}
- Downloads last month
- 5
Model tree for IbrahimAmin/nllb-200-distilled-600M-en-to-arz
Base model
facebook/nllb-200-distilled-600M