Model Card for ota-mdeberta-v3-base

This is a masked‐language (fill‐mask) model for classical Ottoman Turkish, fine-tuned from the microsoft/mdeberta-v3-base checkpoint. It was trained on a corpus of 38,732,566 drawn from 144 literary works in poetry and prose composed between the 15th and 20th centuries.

Model Details

Unlike FacebookAI/xlm-roberta-base, the tokenizer of microsoft/mdeberta-v3-base recognizes characters in IJMES transliteration alphabet, such as Ḥ and ẓ.

Model Description

Developed by: Enes Yılandiloğlu
Shared by: Enes Yılandiloğlu
Model type: fill-mask
Language(s) (NLP): Ottoman Turkish (1500-1928)
License: cc-by-nc-4.0
Finetuned from model: microsoft/mdeberta-v3-base

Uses

Direct Use

Mask filling & completion of Ottoman Turkish sentences

Downstream Use

Named Entity Recognition
UD-style annotation
Translation

Bias, Risks, and Limitations

Potential to reproduce offensive content
The training data originates from digitized Ottoman texts, which rarely include offensive language. Such phrases were censored by the scholars who digitized the texts. This often means at least one letter in the phrase was replaced with a dot. Thus, the model may generate or complete text containing outdated slurs, sectarian insults, or derogatory language that appear in the original manuscripts.
Cultural and historical bias
Since the data reflects the norms and viewpoints of past eras, gender, ethnic, or religious biases present in the source material can be mirrored in generated outputs.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# 1. Load your finetuned model & tokenizer from the Hub
model_name = "enesyila/ota-mdeberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# 3. Run it on an Ottoman-style sentence
sequence = "Ne yanar kimse bana âteş-i [MASK] özge"
results = unmasker(sequence)

# 4. Print the top 5 predictions
for r in results:
    print(f"{r['sequence']} (score: {r['score']:.4f})")

Training Details

Training Data

The training data consists of 144 Ottoman Turkish works written between 15 and 20th centuries. The dataset will be released soon.

Preprocessing

The footnotes and page numbers added by the editor were removed via font size-based and regex rules.

Training Hyperparameters

Training regime: FP16 mixed-precision (enabled via fp16=True) with PyTorch 2.0’s torch.compile for JIT optimizations and gradient_checkpointing=True to reduce activation memory.
Chunk-size: 128 tokens
Batching
- per_device_train_batch_size=32
- per_device_eval_batch_size=32
Optimizer & Schedule
- Optimizer: AdamW
- Learning rate: 2 × 10⁻⁵
- Learning scheduler: linear
- Weight decay: 0.01
- Warmup ratio: 0.06
Training schedule:
- Number of epochs = 5