Model Card for ota-mdeberta-v3-base

This is a masked‐language (fill‐mask) model for classical Ottoman Turkish, fine-tuned from the microsoft/mdeberta-v3-base checkpoint. It was trained on a corpus of 38,732,566 drawn from 144 literary works in poetry and prose composed between the 15th and 20th centuries.

Model Details

Unlike FacebookAI/xlm-roberta-base, the tokenizer of microsoft/mdeberta-v3-base recognizes characters in IJMES transliteration alphabet, such as Ḥ and ẓ.

Model Description

  • Developed by: Enes Yılandiloğlu
  • Shared by: Enes Yılandiloğlu
  • Model type: fill-mask
  • Language(s) (NLP): Ottoman Turkish (1500-1928)
  • License: cc-by-nc-4.0
  • Finetuned from model: microsoft/mdeberta-v3-base

Uses

Direct Use

Mask filling & completion of Ottoman Turkish sentences

Downstream Use

  • Named Entity Recognition
  • UD-style annotation
  • Translation

Bias, Risks, and Limitations

  • Potential to reproduce offensive content
    The training data originates from digitized Ottoman texts, which rarely include offensive language. Such phrases were censored by the scholars who digitized the texts. This often means at least one letter in the phrase was replaced with a dot. Thus, the model may generate or complete text containing outdated slurs, sectarian insults, or derogatory language that appear in the original manuscripts.
  • Cultural and historical bias
    Since the data reflects the norms and viewpoints of past eras, gender, ethnic, or religious biases present in the source material can be mirrored in generated outputs.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# 1. Load your finetuned model & tokenizer from the Hub
model_name = "enesyila/ota-mdeberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# 3. Run it on an Ottoman-style sentence
sequence = "Ne yanar kimse bana âteş-i [MASK] özge"
results = unmasker(sequence)

# 4. Print the top 5 predictions
for r in results:
    print(f"{r['sequence']} (score: {r['score']:.4f})")

Training Details

Training Data

The training data consists of 144 Ottoman Turkish works written between 15 and 20th centuries. The dataset will be released soon.

Preprocessing

The footnotes and page numbers added by the editor were removed via font size-based and regex rules.

Training Hyperparameters

  • Training regime: FP16 mixed-precision (enabled via fp16=True) with PyTorch 2.0’s torch.compile for JIT optimizations and gradient_checkpointing=True to reduce activation memory.
  • Chunk-size: 128 tokens
  • Batching
    • per_device_train_batch_size=32
    • per_device_eval_batch_size=32
  • Optimizer & Schedule
    • Optimizer: AdamW
    • Learning rate: 2 × 10⁻⁵
    • Learning scheduler: linear
    • Weight decay: 0.01
    • Warmup ratio: 0.06
  • Training schedule:
    • Number of epochs = 5

Performance

Epoch Training Loss Validation Loss Perplexity (Val)
1 3.8525 2.0914 8.09
2 2.0859 1.7165 5.56
3 1.8223 1.5849 4.88
4 1.7002 1.4949 4.46
5 1.6427 1.4765 4.38

Model Card Authors

Enes Yılandiloğlu

Model Card Contact

[email protected]

Downloads last month
1
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-mdeberta-v3-base

Finetuned
(236)
this model
Finetunes
1 model

Collection including enesyila/ota-mdeberta-v3-base