Yoruba Sentence Boundary Detection Model
A BERT-based token classification model for sentence boundary detection in Yoruba text. This model identifies where sentences begin and end in continuous Yoruba text — a foundational step for tokenization, translation, summarization, and other NLP pipelines.
📄 Dataset: abnuel/yor_punctuation (1M–10M tokens) 🔗 Related: abnuel/yoruba_sent_boundary_2 — improved iteration
Model Description
Sentence boundary detection (SBD) in Yoruba presents unique challenges: the language uses tonal diacritics, has complex morphology, and real-world Yoruba text is frequently unpunctuated or inconsistently punctuated. Standard rule-based SBD approaches designed for English fail to capture these linguistic patterns.
This model takes a sequence labeling approach, tagging each token as a sentence boundary or continuation token, fine-tuned on a large Yoruba corpus.
- Base model: Davlan/bert-base-multilingual-cased-finetuned-yoruba
- Task: Token classification (sentence boundary detection)
- Language: Yoruba (
yo) - Parameters: 177.3M
- Architecture: BERT
Labels
| Label | Description |
|---|---|
O |
Not a sentence boundary |
B-SENT |
Beginning / end of sentence boundary |
(Check config.json for the exact id2label mapping.)
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_id = "abnuel/yoruba_sent_boundary"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)
# Example: continuous Yoruba text without explicit sentence markers
text = "Mo jókòó sí ilé Èmi yóò padà wá ní àárọ ọjọ́ kejì àwọn ará ilé mi dúpẹ́"
result = nlp(text)
print(result)
Training Details
- Fine-tuning approach: Token classification head on top of Yoruba-adapted multilingual BERT
- Dataset: abnuel/yor_punctuation
- Dataset size: 1M–10M tokens
- Reference: arxiv:1910.09700 — Punctuation Restoration using Transformer Models for NLP tasks
Model Iterations
This is the first version. For improved performance, see:
- abnuel/yoruba_sent_boundary_2 — refined training, improved boundary precision
- abnuel/yoruba_sent_boundary_3 — latest iteration
Limitations
- Best suited for standard written Yoruba; performance may degrade on heavily code-switched or dialectal text.
- Tonal diacritics (e.g., àáâ) should be present for optimal results; the model was not specifically evaluated on diacritic-stripped text.
- As a pioneering tool for Yoruba SBD, evaluation benchmarks are limited — community evaluation and contributions are welcome.
Why This Matters
Yoruba is spoken by ~45 million people but has minimal NLP infrastructure compared to European languages. Sentence boundary detection is foundational — without it, downstream tasks like machine translation, summarization, and speech-to-text post-processing are significantly impaired. This model is part of a broader effort to build the NLP toolchain for Yoruba and other low-resource African languages.
Related Models & Resources
- abnuel/yoruba_task1_punctuation_model — Punctuation restoration for Yoruba
- abnuel/yoruba_punctuation_2 — Updated punctuation model
- abnuel/yor_punctuation — Shared training dataset
Citation
@misc{adegunlehin2025yoruba-sbd,
author = {Abayomi Adegunlehin},
title = {Yoruba Sentence Boundary Detection Model},
year = {2025},
url = {https://huggingface.co/abnuel/yoruba_sent_boundary}
}
- Downloads last month
- 14