How much did N-ATLaS-LLM move the needle? A Focused Evaluation of N-ATLaS on Yoruba, Igbo, and Hausa with AfroBench
Table of Contents
- TL;DR
- Background
- Why AfroBench? Methodology and Evaluation Framework
- Results and Discussion
- N-ATLaS Evaluation Huggingface Space
- Conclusion
TL;DR
- What we're looking at. N-ATLaS-LLM is a fine-tuned Llama-3-8B model on Nigerian languages including Yoruba, Hausa, Igbo and English. It was recently released by the Nigerian Federal Government, through the Ministry of Communications, Innovation and Digital Economy, in partnership with Awarri Technologies.
- Why we evaluated it. The model card leans on human judgments (fluency, coherence, relevance, accuracy, bias/fairness, usefulness). Useful, but not easily reproducible. We wanted replicable, apples-to-apples numbers on standardized tasks.
- Benchmark used. AfroBench-LITE, a compute-friendly slice of AfroBench spanning seven families: Intent Classification, Topic Classification, Natural Language Inference, Reading Comprehension, MMLU, Mathematical Reasoning, and Machine Translation (en↔xx).
- How we scored. Accuracy for Intent/Topic/NLI/MMLU, normalized accuracy for Reading Comprehension, exact match for Mathematical Reasoning, and chrF for Machine Translation.
- Comparison frame. We evaluated Llama-3-8B-base (the pretrained model, no instruction tuning), Llama-3-8B-Instruct (the instruction-tuned version), and N-ATLaS-LLM across various prompting strategies. Since N-ATLaS's origin (base vs Instruct) wasn't clearly documented, we tested both possibilities to establish proper baselines.
- Key Findings. Figure 1 shows the story: The most valid comparison is Llama-3-8B-Instruct zero-shot (chat) vs N-ATLaS zero-shot (chat), both instruction-tuned models with chat templates. Here, N-ATLaS shows clear improvements: Yoruba (24.4→38.4), Igbo (27.7→38.7), Hausa (28.9→44.1), and English (54.8→62.8). The real revelation? N-ATLaS with few-shot prompting substantially outperforms Llama-3-8B-Instruct few-shot, while we see Llama-Instruct regress with examples, N-ATLaS thrives, suggesting the fine-tuning improved not just language knowledge but in-context learning capability.
- Where we go next. In the detailed section we provide a task-level breakdown of the evaluation including per-sample generations showing improvements and regressions.
Figure 1. Average scores across all tasks by language. N-ATLaS increases performance for Yoruba (24.4→38.4), Igbo (27.7→38.7), Hausa (28.9→44.1), and English (54.8→62.8) compared to Llama-3-8B-Instruct zero-shot baseline.
Background
In September 2025, the Nigerian Federal Government, through the Ministry of Communications, Innovation and Digital Economy, in partnership with Awarri Technologies, released N-ATLaS-LLM, a Llama-3 8B-based model fine-tuned on approximately 392 million tokens across four languages: English, Hausa, Igbo, and Yoruba. While the model card presents human evaluation results, these metrics, while valuable, left questions about how the model performs on standardized NLP benchmarks. We needed something more systematic. Something that would let us answer concrete questions: How does this model actually perform on standard NLP tasks? Where did the fine-tuning help? Where did it hurt? And critically, how does it compare to its base model, Llama-3 8B?
That's the motivation for this evaluation.
Why AfroBench? Methodology and Evaluation Framework
The Challenge of Evaluating African Language Models
Here's the thing about evaluating language models on African languages: there's a serious shortage of comprehensive, standardized benchmarks. You might get great results on English tasks, but what does that tell you about Yoruba? Or Igbo? Large-scale multilingual evaluations like MEGA often include only a handful of African languages, and when they do, the coverage is sparse at best.
AfroBench (Ojo et al., 2023) is a comprehensive evaluation framework covering 64 African languages, 15 tasks, and 22 datasets. It provides a rigorous, standardized way to measure LLM performance across the linguistic diversity of the continent.
The Compute Reality: Why AfroBench-LITE
Running the full AfroBench is computationally expensive. With 15 tasks, 3-5 prompting strategies per task, and even focusing on just our 4 languages of interest (English, Yoruba, Igbo, Hausa), we're looking at hundreds of evaluation runs, each requiring full model inference over entire test sets. AfroBench-LITE is a streamlined subset designed for evaluation under compute constraints, covering 7 datasets, 14 African languages, and 7 task families, while maintaining broad NLP coverage.
AfroBench-LITE Task Coverage
- Understanding Tasks: AfriXNLI (NLI), Belebele (Reading Comprehension), AfriMMLU (Multitask Understanding)
- Generation Tasks: African FLORES (Machine Translation), SIB (News Topic Classification), Injongo Intent (Intent Classification)
- Reasoning Tasks: AfriMGSM (Mathematical Reasoning)
Our Evaluation Setup
We made a few strategic decisions about how to run these evaluations:
- Proper Baseline Comparison: The N-ATLaS model card does not clearly state whether the checkpoint was fine-tuned from Llama-3-8B-base or Llama-3-8B-Instruct. We assumed an Instruct origin, but to cover the full design space we also evaluated Llama-3-8B-base (zero-shot, no chat template) alongside Llama-3-8B-Instruct (zero- and few-shot; k=5 for all tasks, k=7 for AfriMGSM). This would show us exactly what the fine-tuning was achieving, not just how N-ATLaS-LLM performs at some absolute level, but how much it improved (or regressed) relative to its plausible starting points.
- Language-Specific Granularity: We decided to be more granular than the original AfroBench-LITE reporting. The benchmark typically reports results as "English" vs "African languages" in aggregate. That makes sense when you're covering 64 or 14 languages, but we're specifically interested in the four languages N-ATLaS was trained on: English, Yoruba, Igbo, and Hausa. So we broke out the results by language. This matters. Treating all African languages as a monolith obscures important differences. Hausa, Igbo, and Yoruba have different linguistic structures, different amounts of training data available, and potentially different fine-tuning data quality in the N-ATLaS training set.
- Bidirectional Machine Translation: For the Machine Translation task, we expanded the evaluation slightly. AfroBench-LITE only includes English-to-XX translation (en→xx), but the full AfroBench includes both directions. Since translation quality can vary significantly based on direction—and because translating into low-resource languages is often harder than translating from them, we included both:
- en→xx: English to African languages
- xx→en: African languages to English
Evaluation Metrics
We used the lm-evaluation-harness (EleutherAI) framework to run these evaluations. This is the standard tool for LLM benchmarking, ensuring reproducibility and consistency. For metrics, we stuck with what AfroBench established as primary for each task:
| Task | Metric | Description |
|---|---|---|
| Intent Classification | Accuracy | % correctly classified intents |
| Topic Classification (TC) | Accuracy | % correctly classified news topics |
| Natural Language Inference (NLI) | Accuracy | % correct entailment/contradiction/neutral |
| MMLU | Accuracy | % correct answers on multitask understanding |
| Reading Comprehension (RC) | Normalized Accuracy | Normalized score accounting for answer span |
| Mathematics (MGSM) | Exact Match | Flexible extraction matching for numerical answers |
| Machine Translation (MT) | chrF | Character n-gram F-score |
Results and Discussion
Overall Performance
Let's start with the big picture. Here's how both models performed across all tasks, averaged by language:
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 36.5 | 54.8 | 45.1 | 36.6 | 62.8 | 72.1 |
| Yoruba | 22.7 | 24.4 | 24.0 | 27.1 | 38.4 | 47.9 |
| Igbo | 23.6 | 27.7 | 25.6 | 27.2 | 38.7 | 49.3 |
| Hausa | 24.7 | 28.9 | 25.8 | 28.5 | 44.1 | 54.6 |
Table 1. Average accuracy scores across all tasks by language and model configuration. N-ATLaS shows consistent improvements over Llama-3-8B-Instruct, with the largest gains appearing in few-shot settings.
*Note: k=5 for most tasks, k=7 for AfriMGSM
*chat means with-chat-template while no chat refers to the model configuration without-chat-template
The Fairest Comparison: Instruct-to-Instruct
The most meaningful comparison is Llama-3-8B-Instruct zero-shot (chat) versus N-ATLaS zero-shot (chat)—both are instruction-tuned models evaluated with chat templates. Here's what the fine-tuning achieved:
- English: 54.8 → 62.8 (+8.0 points, +15% relative)
- Yoruba: 24.4 → 38.4 (+14.0 points, +57% relative)
- Igbo: 27.7 → 38.7 (+11.0 points, +40% relative)
- Hausa: 28.9 → 44.1 (+15.2 points, +53% relative)
These are substantial improvements, especially for the Nigerian languages. The fine-tuning didn't just help, it nearly doubled performance on Yoruba and Hausa.
The Few-Shot Paradox
Here's where things get interesting. Llama-3-8B-Instruct regresses with few-shot examples across all languages: English drops from 54.8 to 45.1, Nigerian languages drop 0.4-3.3 points. But N-ATLaS thrives with examples:
- English: 62.8 → 72.1 (+9.3 points)
- Yoruba: 38.4 → 47.9 (+9.5 points)
- Igbo: 38.7 → 49.3 (+10.6 points)
- Hausa: 44.1 → 54.6 (+10.5 points)
N-ATLaS did not wear the same outfit. While Llama-Instruct struggles with in-context examples, N-ATLaS consistently improves. This suggests the fine-tuning enhanced not just language-specific knowledge but the model's ability to learn from demonstrations, an improvement that compounds the direct gains.
Why Base Model Comparisons Matter Less
We included Llama-3-8B-base (no chat) evaluations to cover all possibilities since N-ATLaS's origin wasn't clearly stated. Base-to-base comparisons (both without chat templates) show modest improvements: Nigerian languages gain 3-4 points, English stays flat. But this comparison has limited value, it mixes the effects of instruction tuning with language-specific fine-tuning. The instruct-to-instruct comparison isolates what the Nigerian language fine-tuning actually achieved.
The Persistent Gap
Even with strong improvements, substantial gaps remain between English and Nigerian languages. At best (few-shot, chat), English reaches 72.1 while Hausa tops out at 54.6—a 17.5 point difference (24% lower). This reflects Llama-3's English-dominant pretraining: decades of English NLP research, vastly more training data, and more mature benchmarks. Fine-tuning on 392 million tokens narrowed the gap but couldn't close it.
That gap represents millions of people who still can't fully access AI in their native languages.
Task-by-Task Breakdown
The overall numbers tell one story, but the task-level breakdown reveals where N-ATLaS actually moved the needle and where it stumbled. Figure 2 shows radar plots for each language, breaking down performance across all seven task families.
What the Radar Plots Reveal
For Yoruba, Igbo, and Hausa: The brownish area (N-ATLaS few-shot) consistently expands beyond all other configurations across nearly all tasks. The most dramatic expansions appear in Intent Classification (INTENT), Machine Translation tasks (MT en→xx and xx→en), Topic Classification (TC) and Mathematical Reasoning (MATH). These aren't just incremental improvements, they're substantial capability gains that change what the model can realistically be used for.
The fairest comparison, Llama-3-8B-Instruct zero-shot (purple) versus N-ATLaS zero-shot (green) shows consistent N-ATLaS advantages across all tasks. Natural Language Inference (NLI) and MMLU demonstrate more modest but steady gains, while other tasks show solid improvements.
For English: The brownish area (N-ATLaS few-shot) still outperforms all other configurations on most tasks. However, comparing the fairest baseline, Llama-3-8B-Instruct zero-shot (purple) versus N-ATLaS zero-shot (green) reveals some nuance. While N-ATLaS leads on most tasks, Llama-3-8B-Instruct zero-shot edges ahead on Mathematical Reasoning (MATH) and Intent Classification (INTENT). This suggests the fine-tuning made some trade-offs on English performance, though the net result with few-shot prompting still strongly favors N-ATLaS.
Now let's dig into the specific tasks and see what stories the numbers are telling.
Figure 2. Task-level performance by language (same radial scale across panels). Brownish area shows N-ATLaS few-shot; green shows N-ATLaS zero-shot; purple shows Llama-3-8B-Instruct zero-shot. N-ATLaS few-shot dominates across all tasks for all languages. While N-ATLaS zero-shot dominates across all tasks for Nigerian languages, with largest gains in Mathematical Reasoning, Topic Classification and Machine Translation. For English, N-ATLaS zero-shot leads in most tasks, though Llama-3-8B-Instruct zero-shot edges ahead on Intent Classification and Mathematical Reasoning tasks.
Intent Classification (Injongo)
| Language | LLaMA-3-8B (0-shot, no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 2.7 | 70.3 | 5.0 | 5.5 | 62.6 | 78.8 |
| Yoruba | 2.5 | 13.6 | 2.6 | 2.8 | 25.8 | 57.4 |
| Igbo | 2.5 | 18.6 | 2.8 | 3.5 | 28.7 | 62.8 |
| Hausa | 2.5 | 18.2 | 2.8 | 3.1 | 39.2 | 70.0 |
Table 2. Intent Classification (Injongo) accuracy scores. Llama-3-8B-Instruct collapses in few-shot settings while N-ATLaS thrives, particularly for Nigerian languages.
This is where the few-shot puzzle becomes most obvious and most dramatic.
Figure 3. Intent Classification (Injongo) accuracy by language and model configuration.
The Collapse of Llama-3-8B-Instruct in Few-Shot
Look at English first: Llama-3-8B-Instruct zero-shot achieves 70.3% accuracy. Respectable. But give it 5 examples and it collapses to 5.0%—a catastrophic drop of 65 points. The Nigerian languages follow a similar pattern: Yoruba drops from 13.6% to 2.6%, Igbo from 18.6% to 2.8%, Hausa from 18.2% to 2.8%. Across the board, few-shot examples break the model.
Why? Intent classification in Injongo involves roughly 40 different intent classes. That's a lot of options. The evaluation uses log-likelihood scoring over these 40 classes, and when you throw 5 examples into the context, it seems like Llama-3-8B-Instruct just... gets confused. Maybe the long multi-example prompt pushes it into a strange corner of its learned distribution. Maybe it learned patterns during instruction tuning that conflict with this kind of dense, multi-class in-context learning.
N-ATLaS Did Not Wear the Same Outfit
But here's what's fascinating: N-ATLaS doesn't break. In fact, it gets dramatically better with examples.
- English: 62.6% (zero-shot) → 78.8% (few-shot), a +16.2 point gain
- Yoruba: 25.8% → 57.4% (+31.6 points!)
- Igbo: 28.7% → 62.8% (+34.1 points)
- Hausa: 39.2% → 70.0% (+30.8 points)
These are massive improvements. For Yoruba, Igbo, and Hausa, few-shot prompting more than doubles performance. This completely flips the script—if the evaluation mechanism (40-class log-likelihood with 5-shot context) were fundamentally problematic, N-ATLaS would fail too. It doesn't. It thrives.
What This Actually Means
We initially wondered if the few-shot regression was a context length issue or some artifact of the evaluation setup. But N-ATLaS proves that theory wrong. The fine-tuning must have done something to make the model more robust to multi-example prompts with many output classes. Maybe the training data included similar patterns. Maybe the model learned better representations that don't get destabilized by long contexts with multiple label options.
Whatever the reason, the takeaway is clear: N-ATLaS turned intent classification from a weakness into a relative strength, especially for the Nigerian languages. Hausa's 70.0% in few-shot is approaching the English Llama-3-8B-Instruct zero-shot (70.3%), which is frankly impressive for a low-resource language.
A Note on Evaluation Methodology
That said, we should probably talk about the elephant in this particular room: is log-likelihood over 40 classes really the best way to evaluate intent classification?
To clarify: the evaluation uses log-likelihood scoring, where the model assigns a probability to each of the ~40 intent classes, and we select the class with the highest probability. Specifically, for each intent class $c$, we compute:
The predicted intent is:
This approach is standard in lm-evaluation-harness for classification tasks, it's computationally efficient and directly measures the model's internal confidence.
But with 40 options, small probability differences between plausible intents can swing the prediction, and models might be getting "close enough" answers that are marked as completely wrong. An LLM-as-judge approach or even using semantic similarity to rank intents might give us a more nuanced picture of what these models actually understand. But that's a conversation for another day and a better evaluation framework.
Topic Classification (SIB)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 63.9 | 74.1 | 74.2 | 47.6 | 82.9 | 85.0 |
| Yoruba | 36.9 | 23.7 | 38.5 | 43.8 | 64.9 | 74.1 |
| Igbo | 43.0 | 37.2 | 48.4 | 45.1 | 69.1 | 76.9 |
| Hausa | 43.7 | 41.5 | 48.8 | 45.7 | 71.8 | 80.5 |
Table 3. Topic Classification (SIB) accuracy scores. N-ATLaS demonstrates strong improvements across all languages, with Nigerian languages approaching English performance in few-shot settings.
Figure 4. Topic Classfication (SIB) accuracy by language and model configuration.
The Instruction Effect
With chat templates, Llama-3-8B-Instruct (zero-shot) performs decently: 74.1 on English, but notably weaker on the Nigerian languages (23.7 Yoruba, 37.2 Igbo, 41.5 Hausa). Interestingly, few-shot helps Llama-3-8B-Instruct on the Nigerian languages, Yoruba jumps to 38.5, Igbo to 48.4, Hausa to 48.8—but English stays basically flat at 74.2.
N-ATLaS with chat templates, though, is where the real action is. Zero-shot, it hits 82.9 on English (beating Llama-3-8B-Instruct by 8.8 points) and shows strong gains on Nigerian languages: Yoruba 64.9, Igbo 69.1, Hausa 71.8. With few-shot examples, N-ATLaS climbs even higher: English reaches 85.0, and the Nigerian languages hit 74.1 (Yoruba), 76.9 (Igbo), and 80.5 (Hausa).
What's Happening Here
Topic classification is more structured than intent, you're categorizing news articles into topics like politics, sports, entertainment, etc. There are fewer classes than intent (just 7 classes), and the task is more about understanding the semantic gist of a passage rather than picking from a huge label space.
N-ATLaS clearly learned something useful about how to approach this task during fine-tuning, and it translates well across languages. The fact that Hausa few-shot (80.5%) nearly matches English few-shot (85.0%) is actually remarkable, there's only a 4.5-point gap, suggesting the model has developed reasonably language-agnostic topic classification capabilities.
The English regression without chat templates (47.6 vs 63.9 base) is worth noting though. It suggests that fine-tuning might have shifted the model's "default" mode to expect certain kinds of formatting or structure. Once you add chat templates, that issue disappears and N-ATLaS actually exceeds the base model. This is a common pattern in fine-tuning: you sometimes trade raw zero-shot performance for better structured-prompt performance.
One More Thing
Unlike intent classification, Llama-3-8B-Instruct doesn't collapse in few-shot here. It actually improves slightly. This reinforces our earlier point: the intent classification few-shot breakdown wasn't about context length or few-shot prompting in general—it was something specific to that task's structure. Topic classification has fewer classes and different input formats, and that seems to make the difference.
Natural Language Inference (AfriXNLI)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 38.4 | 45.9 | 47.1 | 40.6 | 49.7 | 55.5 |
| Yoruba | 34.2 | 36.6 | 37.5 | 35.9 | 39.4 | 45.0 |
| Igbo | 34.4 | 37.8 | 38.8 | 35.5 | 39.6 | 46.0 |
| Hausa | 35.1 | 37.8 | 39.0 | 36.3 | 41.7 | 47.7 |
Table 4. Natural Language Inference (AfriXNLI) accuracy scores. Modest but consistent improvements across all languages, with both models benefiting from few-shot examples.
Natural Language Inference is one of those tasks that looks simple but is actually quite hard. You get two sentences, a premise and a hypothesis; and you have to decide: does the premise entail the hypothesis (yes), contradict it (no), or neither (neutral)? It requires understanding subtle semantic relationships, not just keyword matching.
Figure 5. Natural Language Inference (AfriXNLI) accuracy by language and model configuration.
Consistent, Modest Improvements
The pattern here is remarkably consistent: N-ATLaS improves over Llama-3-8B across all languages and all configurations, but the gains are modest, we're talking 2-8 point improvements depending on the setup.
With chat templates (zero-shot), N-ATLaS edges ahead: English 49.7 vs 45.9 (+3.8), Yoruba 39.4 vs 36.6 (+2.8), Igbo 39.6 vs 37.8 (+1.8), Hausa 41.7 vs 37.8 (+3.9).
Few-shot examples help both models, but N-ATLaS gains more. At few-shot with chat, N-ATLaS reaches 55.5 (English), 45.0 (Yoruba), 46.0 (Igbo), and 47.7 (Hausa), compared to Llama-3-8B-Instruct's 47.1, 37.5, 38.8, and 39.0 respectively. That's roughly 7-8 point improvements across the board.
Why So Modest?
NLI is fundamentally about logical reasoning over language. The task doesn't depend as heavily on having seen similar examples in training—it's more about core language understanding and reasoning capabilities.
Fine-tuning on more Yoruba, Igbo, and Hausa text helps, but it's not going to transform the model's reasoning abilities overnight.
The improvements we're seeing are probably coming from two places: (1) better language-specific representations that help the model parse the premise and hypothesis correctly, and (2) possibly some exposure to similar reasoning patterns in the fine-tuning data. But NLI is hard, and even state-of-the-art English models don't score dramatically higher on XNLI variants.
One thing that stands out: the gap between English and the Nigerian languages is relatively small here compared to other tasks. And unlike intent classification, few-shot prompting consistently helps both models on NLI. Llama-3-8B-Instruct gains 1-2 points from zero- to few-shot, and N-ATLaS gains 5-8 points.
Reading Comprehension (Belebele)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 39.4 | 30.8 | 27.0 | 33.4 | 68.4 | 81.2 |
| Yoruba | 24.1 | 22.9 | 22.9 | 25.7 | 35.8 | 39.4 |
| Igbo | 25.3 | 23.0 | 22.9 | 25.6 | 36.5 | 42.5 |
| Hausa | 27.1 | 23.2 | 22.9 | 27.4 | 43.7 | 51.4 |
Table 5. Reading Comprehension (Belebele) normalized accuracy scores. N-ATLaS shows dramatic improvements especially for English, while Llama-3-8B-Instruct regresses with few-shot prompting.
Reading comprehension is where Llama-3-8B-Instruct has one of its worst showings.
Figure 6. Reading Comprehension (Belebele) normalized accuracy by language and model configuration.
Llama-3-8B-Instruct's Reading Comprehension Problem
Llama-3-8B base scores 39.4 on English, not great, but reasonable. But with the instruct model and chat template? It drops to 30.8 zero-shot and 27.0 few-shot. That's a regression of nearly 10 points from zero- to few-shot. The Nigerian languages hover around 22-23% across all Llama-3-8B-Instruct configs, barely better than random guessing (which would be 25% for a 4-option multiple choice).
N-ATLaS's Massive Leap
N-ATLaS, on the other hand, shows one of its biggest wins here. Without chat templates, it's similar to base Llama-3-8B: 33.4 (English), 25.7 (Yoruba), 25.6 (Igbo), 27.4 (Hausa). But turn on the chat template and suddenly English jumps to 68.4 zero-shot and 81.2 few-shot, more than doubling from the no-chat baseline.
The Nigerian languages also see significant gains: Hausa reaches 51.4 few-shot (nearly double its baseline), Igbo hits 42.5, and Yoruba reaches 39.4. These aren't record-breaking numbers in absolute terms, but relative to where we started, they're substantial.
MMLU
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 50.4 | 41.0 | 49.1 | 49.6 | 52.0 | 57.3 |
| Yoruba | 30.6 | 29.3 | 33.3 | 31.9 | 33.5 | 36.3 |
| Igbo | 30.7 | 28.5 | 32.6 | 33.3 | 33.7 | 37.0 |
| Hausa | 30.2 | 27.6 | 30.2 | 32.7 | 34.3 | 37.8 |
Table 6. Multitask Language Understanding (AfriMMLU) accuracy scores. Steady improvements across languages and model configurations.
MMLU is arguably one of the most comprehensive benchmarks we have. It spans 57 subjects, everything from high school math and history to professional law and medicine. It's testing not just language understanding but world knowledge across domains.
Steady, Modest Improvements
The story here is consistent but not spectacular. N-ATLaS improves across the board, but we're talking 2-8 point gains depending on the configuration.
Base comparison (no chat): English essentially flat (50.4 → 49.6), but Nigerian languages show small gains—Yoruba +1.3, Igbo +2.6, Hausa +2.5.
With chat templates (zero-shot): English gains marginally (52.0 vs 41.0 for Llama-3-8B-Instruct), and Nigerian languages show 3-7 point improvements over Llama-3-8B-Instruct.
Few-shot tells a similar story: N-ATLaS edges ahead consistently. English reaches 57.3, and the Nigerian languages cluster around 36-38%.
Mathematical Reasoning (AfriMGSM)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (7-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (7-shot*, chat) |
|---|---|---|---|---|---|---|
| English | 24.1 | 66.5 | 68.3 | 43.0 | 60.9 | 75.0 |
| Yoruba | 7.7 | 20.0 | 9.0 | 22.4 | 31.2 | 35.0 |
| Igbo | 5.4 | 21.0 | 8.2 | 20.2 | 24.6 | 30.6 |
| Hausa | 9.3 | 25.1 | 10.8 | 25.9 | 34.2 | 40.4 |
Table 7. Mathematical Reasoning (AfriMGSM) exact match (flexible-extract) scores. N-ATLaS shows substantial gains while Llama-3-8B-Instruct collapses in few-shot settings for Nigerian languages.
Math is where N-ATLaS shows some of its most impressive gains and where Llama-3-8B-Instruct reveals another dramatic few-shot failure.
Figure 8. Mathematical Reasoning (AfriMGSM) exact match (flexible-extract) scores by language and model configuration.
The Few-Shot Collapse (Again)
Look at Llama-3-8B-Instruct: zero-shot with chat, it scores 66.5% on English, 20-25% on Nigerian languages. Respectable. Add 7-shot examples (we used 7 for math instead of 5), and English barely budges (68.3%), but the Nigerian languages collapse: Yoruba drops to 9.0%, Igbo to 8.2%, Hausa to 10.8%.
This is the intent classification problem all over again, but even worse. Math problems require multi-step reasoning, and with 7 worked examples in the context, Llama-3-8B-Instruct just... breaks. Maybe the long examples push past some context threshold. Maybe the model learned patterns during instruction tuning that conflict with chain-of-thought prompting in low-resource languages.
N-ATLaS Climbs Where Others Fall
N-ATLaS, meanwhile, shows consistent improvement with examples:
- English: 43.0% (no chat) → 60.9% (0-shot, chat) → 75.0% (few-shot, chat)
- Yoruba: 22.4% → 31.2% → 35.0%
- Igbo: 20.2% → 24.6% → 30.6%
- Hausa: 25.9% → 34.2% → 40.4%
These are substantial relative gains. Yoruba improves 57% from no-chat to few-shot. Hausa improves 56%. Even English, which started with reasonable performance, jumps to 75.0%—higher than Llama-3-8B-Instruct's 68.3%.
Why Math Matters
Mathematical reasoning is special. It's not just about vocabulary or cultural knowledge—it's about logical thinking, following multi-step procedures, and extracting numerical answers correctly. Math word problems in low-resource languages are particularly challenging because you need to:
- Parse the problem statement correctly (language understanding)
- Identify the mathematical operations needed (reasoning)
- Perform the calculations (which should be language-agnostic, but formatting matters)
- Extract and format the answer appropriately
If any step fails, you get the wrong answer. The fact that N-ATLaS can push Hausa to 40.4% and Yoruba to 35.0% suggests the fine-tuning included enough mathematical text in these languages to build those capabilities. That's actually really valuable—math education in native languages is a huge practical application.
What the Models Actually Generate
Numbers tell part of the story, but looking at actual model outputs reveals the qualitative differences. Here's the same word problem across all four languages, comparing Llama-3-8B-Instruct (zero-shot, chat) with N-ATLaS (zero-shot, chat):
Question (English): James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?
N-ATLaS (English):
Step 1: Determine the number of sprints James runs in a week. James runs 3 sprints 3 times a week, so he runs 3 * 3 = 9 sprints a week.
Step 2: Calculate the total distance James runs in a week. James runs 60 meters each sprint, so he runs 60 * 9 = 540 meters a week.
Llama-3-8B-Instruct (English):
Let's break it down step by step:
James runs 3 sprints 3 times a week, so he runs a total of: 3 sprints/week × 3 times/week = 9 sprints/week
Each sprint is 60 meters, so the total distance he runs in each sprint is: 9 sprints/week × 60 meters/sprint = 540 meters/week
Therefore, James runs a total of 540 meters a week.
For English, both models arrive at the correct answer (540 meters) with clear reasoning. The structure is nearly identical. This is expected, both are strong English models.
Question (Yoruba): James pinnu láti sá eré ọ̀gbọrangandan 3 ní ìgbà 3 lọ́sẹ̀. Ó sá mítà 60 ní ọ̀gbọrangandan kọ̀ọ̀kan. Iye àpapọ̀ mítà mélòó ni ó máa ń sá lọ́sẹ̀?
N-ATLaS (Yoruba):
Ní ọ̀gbọ̀n ọjọ́, Jákọ́bù máa ń sá 60 * 3 = 180 mítà. Nítorí náà, ó sá 180 * 3 = 540 mítà lọ́sẹ̀.
[Translation: In a day, James runs 60 * 3 = 180 meters. Therefore, he runs 180 * 3 = 540 meters per week.]
Llama-3-8B-Instruct (Yoruba):
Let's break down the question step by step:
- James has 3 pinnu láti (whatever that means) at a time.
- He gives out 60 mítà (whatever that means) in total.
- The question asks how many mítà he will give out in total.
From the information given, we can infer that James gives out 3 mítà at a time. To find the total number of times he gives out mítà, we can divide the total number of mítà (60) by the number of mítà he gives out at a time (3):
60 ÷ 3 = 20
So, James gives out mítà 20 times. [...]
Therefore, the answer is 60.
The contrast is stark. N-ATLaS responds entirely in Yoruba with correct reasoning and the right answer (540). Llama-3-8B-Instruct, on the other hand, admits it doesn't understand key terms ("whatever that means"), misinterprets the problem entirely, and arrives at the wrong answer (60). This is a complete failure of language understanding.
Question (Igbo): James na-ekpebi ịgba ụzọ ọsọ atọ ugboro atọ n'izu ụka. Ọ gbara mita iri isii n'ọsọ ọbụla. Mita ole ka ọ gbazuru n'ime izu ụka?
N-ATLaS (Igbo):
Nzọụkwụ 1: Kpebie ọnụ ọgụgụ ọsọ ọsọ James na-agba n'otu izu. James na-agba ọsọ 3 ugboro n'ụbọchị na ọ gbara 60 mita n'ọsọ ọ bụla. Ya mere, ọ na-agba 3 * 60 = 180 mita kwa ụbọchị.
Nzọụkwụ 2: Gbakọọ ngụkọta anya ọ gbara n'ime izu ụka. Enwere ụbọchị asaa n'izu. Ọ bụrụ na ọ gbara mita 180 kwa ụbọchị, mgbe ahụ ọ gbara 180 * 7 = 1260 mita n'ime izu ụka.
[Translation: Step 1: Decide the number of sprints James runs in one week. James runs 3 sprints per day and he runs 60 meters each sprint. Therefore, he runs 3 * 60 = 180 meters per day. Step 2: Calculate the total distance he ran in the week. There are seven days in a week. If he ran 180 meters per day, then he ran 180 * 7 = 1260 meters in the week.]
Llama-3-8B-Instruct (Igbo):
A Nigerian language question!
Let's break it down step by step:
- "James na-ekpebi ịgba ụzọ ọsọ atọ ugboro atọ n'izu ụka" means "James has decided to run three times around the farm three times".
- "Ọ gbara mita iri isii n'ọsọ ọbụla" means "He ran 16 miles in the last race".
- The question is asking how many miles James ran in total.
To find the answer, we need to calculate the total distance James ran:
- He ran 3 times around the farm, which means he ran 3 x ? miles (we don't know the distance around the farm yet).
- He ran 16 miles in the last race.
Let's assume the distance around the farm is x miles. Then, the total distance James ran is:
3x + 16 miles [...]
N-ATLaS provides a complete, fluent Igbo response with step-by-step reasoning—though it makes a calculation error by computing daily distance then multiplying by 7 days, arriving at 1260 instead of 540. Still, the reasoning structure is sound and the language is natural. Llama-3-8B-Instruct completely misunderstands the problem, inventing details about "running around a farm" and "16 miles in the last race" that don't exist in the original question. It never arrives at an answer.
Question (Hausa): James ya yanke shawarar yin gudu sau 3 a mako. Yana gudun mita 60 kowane yi. Jimilla mita nawa yake gudu a mako daya?
N-ATLaS (Hausa):
Mataki 1: Ƙayyade adadin lokutan James yana gudu a mako. James yana gudu sau 3 a mako.
Mataki na 2: Ƙayyade adadin mita James yana gudu kowace rana. James yana gudu mita 60 kowace rana.
Mataki na 3: Ƙididdige jimlar mita James yana gudu a mako. Mita 60/rana * 3 days = mita 180
Amsa: James yana gudu mita 180 a mako.
[Translation: Step 1: Determine the number of times James runs per week. James runs 3 times per week. Step 2: Determine the number of meters James runs each time. James runs 60 meters each time. Step 3: Calculate the total meters James runs per week. 60 meters/time * 3 times = 180 meters. Answer: James runs 180 meters per week.]
Llama-3-8B-Instruct (Hausa):
Let's break down the question step by step:
- James ya yanke shawarar yin gudu sau 3 a mako: This means James has decided to run 3 miles.
- Yana gudun mita 60 kowane yi: This means each mile is 60 kilometers.
- Jimilla mita nawa yake gudu a mako daya: This means how many kilometers will James run in one mile.
To find the answer, we need to convert miles to kilometers. Since each mile is 60 kilometers, we can multiply the number of miles by 60:
3 miles × 60 kilometers/mile = 180 kilometers
So, James will run 180 kilometers in one mile.
N-ATLaS provides clear Hausa reasoning but makes an error similar to the Igbo case, it calculates 60 * 3 = 180 instead of 60 * 9 = 540 (missing that he runs 3 sprints 3 times per week). Llama-3-8B-Instruct, meanwhile, invents a bizarre conversion between miles and kilometers that doesn't exist in the original problem and arrives at a nonsensical conclusion ("180 kilometers in one mile").
What This Reveals
These examples show something the aggregate scores hint at but don't fully capture: Llama-3-8B-Instruct fundamentally struggles to understand the Nigerian languages. It's not just getting math wrong, it's failing to parse the questions correctly. N-ATLaS, despite making arithmetic errors, consistently demonstrates actual comprehension: it understands what's being asked, structures reasoning appropriately, and responds in fluent, natural language.
This is the difference between a model that's been exposed to these languages during pretraining but never learned to truly understand them, and one that's been fine-tuned with sufficient Nigerian language data to develop real capability. The quantitative gains (20.0% → 31.2% for Yoruba, 21.0% → 24.6% for Igbo, 25.1% → 34.2% for Hausa) are impressive, but these qualitative differences, comprehension versus confusion, matter more for real-world use.
Machine Translation (Flores, chrF)
A. English → African (en→xx)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| Yoruba | 6.2 | 12.3 | 13.9 | 16.2 | 27.2 | 26.5 |
| Igbo | 10.1 | 17.5 | 20.2 | 28.5 | 40.6 | 41.9 |
| Hausa | 17.1 | 26.0 | 29.1 | 32.0 | 49.6 | 50.1 |
Table 8. Machine Translation (FLORES) chrF scores for English-to-African language directions. N-ATLaS achieves 2-4x improvements over Llama-3-8B-Instruct baseline.
Figure 9. Machine Translation chrF scores (English→African languages) across model configurations.
B. African → English (xx→en)
| Language | LLaMA-3-8B (no chat) | LLaMA-3-8B-Instruct (0-shot, chat) | LLaMA-3-8B-Instruct (5-shot*, chat) | N-ATLaS (0-shot, no chat) | N-ATLaS (0-shot, chat) | N-ATLaS (5-shot*, chat) |
|---|---|---|---|---|---|---|
| Yoruba | 19.0 | 22.9 | 23.7 | 23.4 | 38.9 | 41.3 |
| Igbo | 22.1 | 26.6 | 30.7 | 22.8 | 45.4 | 48.1 |
| Hausa | 23.8 | 31.2 | 32.4 | 25.5 | 51.0 | 52.2 |
Table 9. Machine Translation (FLORES) chrF scores for African-to-English directions. N-ATLaS demonstrates strong improvements, with Hausa reaching 52.2 chrF in few-shot settings.
Figure 10. Machine Translation chrF scores (African Languages→English) across model configurations.
Dramatic Gains in Both Directions
Let's start with English→African. Llama-3-8B base is terrible at this: chrF scores of 6.2 (Yoruba), 10.1 (Igbo), 17.1 (Hausa). That means the model barely produces anything resembling the target language. N-ATLaS without chat templates already makes huge strides: 16.2 (Yoruba), 28.5 (Igbo), 32.0 (Hausa)—roughly 2-3x improvements.
Turn on chat templates and it gets even better: 27.2 (Yoruba), 40.6 (Igbo), 49.6 (Hausa). Few-shot gives small additional gains for Igbo and Hausa but slightly hurts Yoruba (26.5 vs 27.2 zero-shot—possibly noise or overfitting to example style).
The reverse direction (African→English) shows similar patterns. Llama-3-8B-Instruct maxes out around 23-32 chrF across languages. N-ATLaS few-shot reaches 41.3 (Yoruba), 48.1 (Igbo), 52.2 (Hausa)—improvements of 15-20 chrF points.
Why Translation Shows the Biggest Wins
Translation is perhaps the most direct test of bilingual capability. To translate well, you need:
- Strong understanding of the source language
- Strong generation capability in the target language
- Alignment between concepts across languages
Fine-tuning on parallel or comparable data in these language pairs directly improves all three. The 392 million token corpus likely included substantial translation examples or at least bilingual text, and that's exactly what the model needed.
The Yoruba Puzzle
One interesting pattern: Yoruba consistently lags behind Igbo and Hausa in translation scores. At few-shot en→xx, Yoruba is at 26.5 while Hausa is at 50.1—nearly double. At xx→en, Yoruba is at 41.3 while Hausa is at 52.2.
This could be due to several factors:
- Data quality/quantity: Maybe the fine-tuning corpus had less Yoruba translation data or lower-quality Yoruba text
- Linguistic distance: Yoruba's tonal system and different morphology might make it harder to learn from limited data
- Evaluation artifacts: chrF is character-based, and orthographic differences between languages might affect scores differently
Still, even at 26.5-41.3 chrF, Yoruba is vastly improved over base Llama-3-8B's 6.2-19.0. It's just that Hausa and Igbo improved even more.
Translation in Practice: What the Models Actually Produce
chrF scores are useful, but they don't tell the full story of translation quality. Let's look at actual translations of the same sentence across all language pairs, comparing Llama-3-8B-Instruct (zero-shot, chat) with N-ATLaS (zero-shot, chat):
Source (English): Danius said, "Right now we are doing nothing. I have called and sent emails to his closest collaborator and received very friendly replies. For now, that is certainly enough."
N-ATLaS (English → Yoruba):
Danius sọ pé, "Lọwọlọwọ, a kò ṣe ohunkóhun. Mo ti pe ati fi imeeli ranṣẹ si ẹni ti o sunmọ ọdọ rẹ, mo si gba awọn idahun ọrẹ pupọ. Fun bayi, iyẹn to."
Llama-3-8B-Instruct (English → Yoruba):
"English translation: "Danius ti wo, "Ojoo, won ni mo fe. Mo ti pe ebi ati emi fun oni omo re ati mo fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si, omo re ti fe lo si"
N-ATLaS produces coherent Yoruba that, while not perfect, maintains the general structure and meaning of the original. Llama-3-8B-Instruct, on the other hand, produces nonsensical repetition, getting stuck in a loop of "omo re ti fe lo si" (a phrase that doesn't even match the source content). This is a complete generation failure.
N-ATLaS (English → Igbo):
Danius kwuru, "Ugbu a, anyị na-eme ihe ọ bụla. Ekpere m oku na izipu ozi-e na onye ya na ya na-arụkọ ọrụ nso, ha zara n'ezie n'ụzọ enyi. Nke ahụ bụ ihe zuru oke ugbu a."
Llama-3-8B-Instruct (English → Igbo):
Otu nke Danius kọ, "Nke a na-ahụ anyị n'ime. M na-akwụọ na-akwụọ email ka onye ọkachamara ya na-ahụ anyị na-ahụ anyị. Nke a bụ nke dị iche iche."
N-ATLaS again provides a mostly coherent Igbo translation that captures key elements of the original. Llama-3-8B-Instruct produces something that vaguely resembles Igbo but is grammatically awkward and loses much of the original meaning—particularly the nuance about "friendly replies" and "for now, that is certainly enough."
N-ATLaS (English → Hausa):
Danius ya ce, "A yanzu ba mu yi komai ba. Na kira kuma na aika imel ga abokin aikinsa mafi kusa kuma na samu amsoshi masu kyau. Wannan tabbas ya isa a yanzu."
Llama-3-8B-Instruct (English → Hausa):
Dan ya ce, "A yau, muka yi karo ba. Nawa kuma ni kai sako da mace gaba da shi da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da aka yi masa tambaya da
N-ATLaS again, conveys the main message in Hausa, though the translation could be more natural. Llama-3-8B-Instruct again falls into a repetitive loop, this time with "aka yi masa tambaya" (which relates to questioning), showing complete generation breakdown.
Now for the reverse direction—translating from Nigerian languages back to English:
Source (Yoruba): Daniusi wipe, "Nisin a o se nkankan. Mo ti meeli ranse mo de ti pe si awon ajo sise timo timo re won de fun mi lesi to loyaya. Fun igba yi, o daniloju pe iyen ti to."
N-ATLaS (Yoruba → English):
He said, "Now, we do nothing. I sent a letter to the relevant authorities and they confirmed that I am not registered. For this time, it is enough."
Llama-3-8B-Instruct (Yoruba → English):
Here's the translation:
Yoruba: "Daniusi wipe, 'Nisin a o se nkankan. Mo ti meeli ranse mo de ti pe si awon ajo sise timo timo re won de fun mi lesi to loyaya. Fun igba yi, o daniloju pe iyen ti to.'" English: "Listen carefully, 'I will not do anything foolish. I have already taken care of my affairs, and I have done what I wanted to do with my friends, and they are now doing it for me. For this time, I will not be deceived by anyone's trickery.'"
Note: The Yoruba language is a complex and nuanced language, and the translation provided is one possible interpretation. The original text may have multiple layers of meaning and cultural references that are specific to the Yoruba culture.
N-ATLaS provides a concise, coherent translation. It's not perfect, it misses some nuances of the original, but it captures the core meaning. Llama-3-8B-Instruct, meanwhile, completely fabricates content about "not doing anything foolish" and "trickery" that doesn't exist in the source. The added disclaimer about "complex and nuanced language" reads like the model knows it's uncertain but proceeds anyway with a hallucinated translation.
Source (Igbo): Danius si, "Ugbu a anyị anaghị eme ihe ọbụla. Akpọọlam oku ma zijere onye ya na ya kacha arụkọ ozi ịntaneetị ma nwetekwa ụsa nnabata. Ugbua nke ahụ zuru oke."
N-ATLaS (Igbo → English):
English: Danius says, "Now we don't do anything. Call him up and meet with his online contact and get an acceptance letter. That's all that's left."
Llama-3-8B-Instruct (Igbo → English):
"English: "Danius said, 'We don't do anything wrong. We call a meeting and invite the one who knows the most about the matter to come and explain it to us, and we accept what he says. That's the way it is.'"
N-ATLaS again provides a reasonable translation, though it slightly misinterprets some details (like "acceptance letter"). Llama-3-8B-Instruct invents content about "doing nothing wrong" and "calling a meeting" that doesn't match the source—it's producing plausible-sounding English but has lost fidelity to the original Igbo.
Source (Hausa): Danius ya ce, "Yanzu haka ba mwa yin komi. Na kira na kuma aika emails zuwa abokan aikinsa na kusa na kuma anshi amsoshi masu kyau. A yanzu haka, wannan bai isa ba."
N-ATLaS (Hausa → English):
Daniel said, "Now, we are not making any more excuses. I have called and sent emails to my close colleagues and received positive responses. Now, this is not enough."
Llama-3-8B-Instruct (Hausa → English):
"English: "Danius said, 'Now, I don't have time for small talk. I'm going to send emails to my colleagues nearby and respond to good messages. At the moment, this won't work.'"
Note: Hausa is a language spoken in West Africa, particularly in Nigeria and other countries. The translation provided is in English, and it's an attempt to convey the meaning of the Hausa sentence.
N-ATLaS captures the core message accurately, preserving the key details about calls, emails, colleagues, and positive responses. Llama-3-8B-Instruct invents content about "small talk" and "this won't work" that doesn't appear in the source, and the disclaimer about Hausa being "spoken in West Africa" feels like the model padding its response when uncertain.
What These Examples Reveal
The pattern is clear: Llama-3-8B-Instruct either produces repetitive gibberish (especially en→xx) or fabricates content that sounds plausible but doesn't match the source (especially xx→en). N-ATLaS, while not perfect, consistently produces translations that preserve core meaning and demonstrate actual understanding of both source and target languages.
This qualitative difference matters far more than chrF scores suggest. A chrF of 27 vs 12 for Yoruba might sound incremental, but it's the difference between usable translation and complete failure. For real applications—whether translating government documents, educational materials, or health information—N-ATLaS crosses a threshold where Llama-3-8B-Instruct simply doesn't function.
A Note on Evaluation: chrF vs LLM-as-Judge
We should acknowledge that chrF, while standard, has limitations. It's a character n-gram metric—essentially measuring surface-level overlap between the model's output and reference translations. But translation isn't just about matching characters; it's about conveying meaning, maintaining style, and sounding natural.
Two translations could have different chrF scores but be equally good (or bad). A translation that uses different but equivalent words would score lower than one that matches the reference exactly, even if both are correct. Conversely, a translation that matches characters but misses the meaning would score well despite being wrong.
The examples above illustrate this perfectly: Llama-3-8B-Instruct's hallucinated translations might occasionally match reference text by coincidence, while N-ATLaS's semantically accurate translations using different phrasing would score lower. LLM-as-judge evaluation, where a strong model rates translations on dimensions like accuracy, fluency, and adequacy, would give us a richer picture.
We're considering running that analysis on a sample of outputs—comparing Llama-3-8B-Instruct zero-shot vs N-ATLaS zero-shot—to complement these quantitative scores. But that's future work. For now, chrF combined with qualitative examples tells us N-ATLaS is substantially better, even if the metric can't capture all the nuances.
N-ATLaS Evaluation Huggingface Space
We've covered aggregate metrics and selected examples, but there's much more to explore. We've built an interactive Hugging Face Space where you can engage directly with both models, examine the full evaluation data, and vote on which model performs better.
The space has three main tabs:
Chat: Interact with N-ATLaS-LLM or Llama-3-8B-Instruct in multi-turn conversations. Ask questions in any of the supported languages and see how each model responds.
Translate: Get side-by-side translations between English and Yoruba/Igbo/Hausa from both models. Enter your own text, compare the outputs, and vote on which translation is better.
Evaluate: Browse pre-computed model outputs for Machine Translation and Mathematical Reasoning tasks. Compare responses side-by-side and vote on which model performs better. This is where you'll find many more examples like the ones shown in this report.
If you speak Yoruba, Igbo, or Hausa, your votes are especially valuable, you can assess translation quality and fluency in ways that automated metrics like chrF cannot capture.
Note: The N-ATLaS model endpoint may be paused intermittently due to cost constraints.
Visit the space: https://huggingface.co/spaces/seun-ajayi/n-atlas-evaluations
Conclusion
Did N-ATLaS move the needle? Yes, significantly in translation and math, modestly in reasoning, substantially in few-shot learning.
Comparing instruction-tuned models (the fairest baseline), N-ATLaS shows 40-57% relative improvements on Nigerian languages at zero-shot, with few-shot prompting adding another 20-30% gains. It transformed translation from broken to usable, made math reasoning viable in native languages, and demonstrated better in-context learning than its base model.
But we're not at language parity. English outperforms Nigerian languages by 24-32% on average, with gaps up to 40 points on knowledge-heavy tasks. Closing these requires sustained effort, more data, better evaluation, continued development.
Still, N-ATLaS is a proof point. It shows that targeted fine-tuning, even on modest compute budgets, makes real progress toward bringing AI to historically underserved languages. That's worth celebrating while acknowledging how far we still need to go.
For AI Researchers building models for African languages: This evaluation demonstrates the value of standardized, reproducible benchmarking. We encourage you to use AfroBench or its lightweight variant AfroBench-LITE to evaluate your models. Consistent evaluation frameworks let us compare progress across projects, identify what works, and build on each other's successes rather than starting from scratch each time.
For Nigerian language NLP practitioners: N-ATLaS isn't the finish line. It's a solid foundation. Use it, build on it, and let's keep moving that needle.