Yoruba Sentence Boundary Detection Model

A BERT-based token classification model for sentence boundary detection in Yoruba text. This model identifies where sentences begin and end in continuous Yoruba text — a foundational step for tokenization, translation, summarization, and other NLP pipelines.

📄 Dataset: abnuel/yor_punctuation (1M–10M tokens) 🔗 Related: abnuel/yoruba_sent_boundary_2 — improved iteration

Model Description

Sentence boundary detection (SBD) in Yoruba presents unique challenges: the language uses tonal diacritics, has complex morphology, and real-world Yoruba text is frequently unpunctuated or inconsistently punctuated. Standard rule-based SBD approaches designed for English fail to capture these linguistic patterns.

This model takes a sequence labeling approach, tagging each token as a sentence boundary or continuation token, fine-tuned on a large Yoruba corpus.

  • Base model: Davlan/bert-base-multilingual-cased-finetuned-yoruba
  • Task: Token classification (sentence boundary detection)
  • Language: Yoruba (yo)
  • Parameters: 177.3M
  • Architecture: BERT

Labels

Label Description
O Not a sentence boundary
B-SENT Beginning / end of sentence boundary

(Check config.json for the exact id2label mapping.)

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_id = "abnuel/yoruba_sent_boundary"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)

# Example: continuous Yoruba text without explicit sentence markers
text = "Mo jókòó sí ilé Èmi yóò padà wá ní àárọ ọjọ́ kejì àwọn ará ilé mi dúpẹ́"
result = nlp(text)
print(result)

Training Details

  • Fine-tuning approach: Token classification head on top of Yoruba-adapted multilingual BERT
  • Dataset: abnuel/yor_punctuation
  • Dataset size: 1M–10M tokens
  • Reference: arxiv:1910.09700 — Punctuation Restoration using Transformer Models for NLP tasks

Model Iterations

This is the first version. For improved performance, see:

Limitations

  • Best suited for standard written Yoruba; performance may degrade on heavily code-switched or dialectal text.
  • Tonal diacritics (e.g., àáâ) should be present for optimal results; the model was not specifically evaluated on diacritic-stripped text.
  • As a pioneering tool for Yoruba SBD, evaluation benchmarks are limited — community evaluation and contributions are welcome.

Why This Matters

Yoruba is spoken by ~45 million people but has minimal NLP infrastructure compared to European languages. Sentence boundary detection is foundational — without it, downstream tasks like machine translation, summarization, and speech-to-text post-processing are significantly impaired. This model is part of a broader effort to build the NLP toolchain for Yoruba and other low-resource African languages.

Related Models & Resources

Citation

@misc{adegunlehin2025yoruba-sbd,
  author = {Abayomi Adegunlehin},
  title  = {Yoruba Sentence Boundary Detection Model},
  year   = {2025},
  url    = {https://huggingface.co/abnuel/yoruba_sent_boundary}
}
Downloads last month
14
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abnuel/yoruba_sent_boundary

Finetuned
(4)
this model

Dataset used to train abnuel/yoruba_sent_boundary

Paper for abnuel/yoruba_sent_boundary