🧠 Model Description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M, specialized for English to Egyptian Arabic (arz) translation. The model was trained to improve performance on informal, dialectal text, particularly in the context of spoken Egyptian Arabic.

The base model is part of the No Language Left Behind (NLLB) initiative.


πŸ’¬ Intended Use

This model is intended for translating English text into Egyptian Arabic (ARZ), particularly:

  • Informal speech
  • Conversational and social media content
  • Spoken dialogue datasets

It is not recommended for use in formal or Modern Standard Arabic (MSA) contexts, as the output will reflect dialectal structures and vocabulary.


πŸ‹οΈ Training Details

  • Base model: facebook/nllb-200-distilled-600M
  • Target language pair: English β†’ Egyptian Arabic (en β†’ arz)
  • Training dataset: IbrahimAmin/arz-en-parallel-corpus
    • Includes subtitle translations, synthetic translations, and conversational Egyptian Arabic text
    • Covers both informal and semi-formal domains
  • Framework: πŸ€— Transformers + PyTorch
  • Training duration: 10 epochs
  • Batch size: 12
  • Learning rate: 2e-5
  • Encoder: Frozen
  • Precision: bf16
Epoch Training Loss Validation Loss
0 No log 12.742368
1 6.736500 6.469766
2 6.328200 6.097203
3 6.004600 5.790025
4 5.745800 5.544414
5 5.537400 5.364527
6 5.339400 5.211165
7 5.224800 5.101339
8 5.131800 5.019337
9 5.076800 4.990577
10 5.059100 4.964704

πŸ§ͺ Evaluation

We evaluated the model using BLEU score on a held-out test set of English–Egyptian Arabic pairs from IbrahimAmin/arz-en-parallel-corpus.

Manual inspection suggests improved handling of:

  • Idiomatic expressions
  • Spoken-style phrasing
  • Common Egyptian dialect vocabulary

πŸ“ Example

Input (en):

How are you doing today?

Output (arz):

Ψ₯Ψ²ΩŠΩƒ Ψ§Ω„Ω†Ω‡Ψ§Ψ±Ψ―Ψ©ΨŸ


⚠️ Limitations

  • May hallucinate or formalize certain expressions depending on the context.
  • Trained primarily on synthetic and semi-formal sources; may not generalize well to highly domain-specific jargon.
  • Not suitable for translating into MSA or other Arabic dialects.

πŸš€ Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = AutoModelForSeq2SeqLM.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", torch_dtype=torch.float16).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained("IbrahimAmin/nllb-200-distilled-600M-en-to-arz", src_lang="eng_Latn", tgt_lang="arz_Arab")

article = "How are you doing today?"
inputs = tokenizer(article, return_tensors="pt").to(device)

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("arz_Arab"))
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

# Output: 'Ψ₯Ψ²ΩŠΩƒ Ψ§Ω„Ω†Ω‡Ψ§Ψ±Ψ―Ψ©ΨŸ'

✨ License

This model is distributed under the same license as the original NLLB-200 model (CC-BY-NC 4.0).
See LICENSE for details.


πŸ“ Citation

If you use this model, please cite:

@misc{ibrahimamin2025nllb200arz,
  title={NLLB-200-600M English to Egyptian Arabic},
  author={Ibrahim Amin},
  year={2025},
  howpublished={\url{https://huggingface.co/IbrahimAmin/nllb-200-distilled-600M-en-to-arz}},
}
Downloads last month
5
Safetensors
Model size
615M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IbrahimAmin/nllb-200-distilled-600M-en-to-arz

Finetuned
(169)
this model

Dataset used to train IbrahimAmin/nllb-200-distilled-600M-en-to-arz