Model Card for ota-mdeberta-v3-base
This is a masked‐language (fill‐mask) model for classical Ottoman Turkish, fine-tuned from the microsoft/mdeberta-v3-base checkpoint. It was trained on a corpus of 38,732,566 drawn from 144 literary works in poetry and prose composed between the 15th and 20th centuries.
Model Details
Unlike FacebookAI/xlm-roberta-base, the tokenizer of microsoft/mdeberta-v3-base recognizes characters in IJMES transliteration alphabet, such as Ḥ and ẓ.
Model Description
- Developed by: Enes Yılandiloğlu
- Shared by: Enes Yılandiloğlu
- Model type: fill-mask
- Language(s) (NLP): Ottoman Turkish (1500-1928)
- License: cc-by-nc-4.0
- Finetuned from model: microsoft/mdeberta-v3-base
Uses
Direct Use
Mask filling & completion of Ottoman Turkish sentences
Downstream Use
- Named Entity Recognition
- UD-style annotation
- Translation
Bias, Risks, and Limitations
- Potential to reproduce offensive content
The training data originates from digitized Ottoman texts, which rarely include offensive language. Such phrases were censored by the scholars who digitized the texts. This often means at least one letter in the phrase was replaced with a dot. Thus, the model may generate or complete text containing outdated slurs, sectarian insults, or derogatory language that appear in the original manuscripts. - Cultural and historical bias
Since the data reflects the norms and viewpoints of past eras, gender, ethnic, or religious biases present in the source material can be mirrored in generated outputs.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
# 1. Load your finetuned model & tokenizer from the Hub
model_name = "enesyila/ota-mdeberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# 3. Run it on an Ottoman-style sentence
sequence = "Ne yanar kimse bana âteş-i [MASK] özge"
results = unmasker(sequence)
# 4. Print the top 5 predictions
for r in results:
print(f"{r['sequence']} (score: {r['score']:.4f})")
Training Details
Training Data
The training data consists of 144 Ottoman Turkish works written between 15 and 20th centuries. The dataset will be released soon.
Preprocessing
The footnotes and page numbers added by the editor were removed via font size-based and regex rules.
Training Hyperparameters
- Training regime: FP16 mixed-precision (enabled via
fp16=True) with PyTorch 2.0’storch.compilefor JIT optimizations andgradient_checkpointing=Trueto reduce activation memory. - Chunk-size: 128 tokens
- Batching
per_device_train_batch_size=32per_device_eval_batch_size=32
- Optimizer & Schedule
- Optimizer: AdamW
- Learning rate: 2 × 10⁻⁵
- Learning scheduler: linear
- Weight decay: 0.01
- Warmup ratio: 0.06
- Training schedule:
- Number of epochs = 5
Performance
| Epoch | Training Loss | Validation Loss | Perplexity (Val) |
|---|---|---|---|
| 1 | 3.8525 | 2.0914 | 8.09 |
| 2 | 2.0859 | 1.7165 | 5.56 |
| 3 | 1.8223 | 1.5849 | 4.88 |
| 4 | 1.7002 | 1.4949 | 4.46 |
| 5 | 1.6427 | 1.4765 | 4.38 |
Model Card Authors
Enes Yılandiloğlu
Model Card Contact
- Downloads last month
- 1