🧠 Model Description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M, specialized for English to Egyptian Arabic (arz) translation. The model was trained to improve performance on informal, dialectal text, particularly in the context of spoken Egyptian Arabic.

The base model is part of the No Language Left Behind (NLLB) initiative.

💬 Intended Use

This model is intended for translating English text into Egyptian Arabic (ARZ), particularly:

Informal speech
Conversational and social media content
Spoken dialogue datasets

It is not recommended for use in formal or Modern Standard Arabic (MSA) contexts, as the output will reflect dialectal structures and vocabulary.

🏋️ Training Details

Base model: facebook/nllb-200-distilled-600M
Target language pair: English → Egyptian Arabic (en → arz)
Training dataset: IbrahimAmin/arz-en-parallel-corpus
- Includes subtitle translations, synthetic translations, and conversational Egyptian Arabic text
- Covers both informal and semi-formal domains
Framework: 🤗 Transformers + PyTorch
Training duration: 10 epochs
Batch size: 12
Learning rate: 2e-5
Encoder: Frozen
Precision: bf16

Epoch	Training Loss	Validation Loss
0	No log	12.742368
1	6.736500	6.469766
2	6.328200	6.097203
3	6.004600	5.790025
4	5.745800	5.544414
5	5.537400	5.364527
6	5.339400	5.211165
7	5.224800	5.101339
8	5.131800	5.019337
9	5.076800	4.990577
10	5.059100	4.964704

🧪 Evaluation

We evaluated the model using BLEU score on a held-out test set of English–Egyptian Arabic pairs from IbrahimAmin/arz-en-parallel-corpus.

Manual inspection suggests improved handling of:

Idiomatic expressions
Spoken-style phrasing
Common Egyptian dialect vocabulary

📝 Example

Input (en):

How are you doing today?

Output (arz):

إزيك النهاردة؟

⚠️ Limitations

May hallucinate or formalize certain expressions depending on the context.
Trained primarily on synthetic and semi-formal sources; may not generalize well to highly domain-specific jargon.
Not suitable for translating into MSA or other Arabic dialects.

🚀 Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = AutoModelForSeq2SeqLM.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", torch_dtype=torch.float16).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", src_lang="eng_Latn", tgt_lang="arz_Arab")

article = "How are you doing today?"
inputs = tokenizer(article, return_tensors="pt").to(device)

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("arz_Arab"))
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

# Output: 'إزيك النهاردة؟'

✨ License

This model is distributed under the same license as the original NLLB-200 model (CC-BY-NC 4.0).
See LICENSE for details.

📝 Citation

If you use this model, please cite:

@misc{ibrahimamin2025nllb200arz,
  title={NLLB-200-600M English to Egyptian Arabic},
  author={Ibrahim Amin},
  year={2025},
  howpublished={\url{https://huggingface.co/IbrahimAmin/nllb-200-distilled-600M-en-to-arz}},
}

IbrahimAmin
/

nllb-200-distilled-600M-en-to-arz