Gemma3-Singlish-Codemix

Fine-tuned Gemma 3 model for converting code-mixed Singlish/English text into proper Sinhala script.

Model Details

Property Value
Base Model savinugunarathna/Gemma3-Singlish-Sinhala-Merged
Fine-tuning Method QLoRA (4-bit, r=16)
Upload Type merged
Task Code-mixed โ†’ Sinhala transliteration

Training Data

  • Phonetic dataset: ~1M Singlish โ†’ Sinhala pairs (sampled subset used)
  • Code-mixed dataset: ~22K Singlish/English โ†’ Sinhala pairs
  • Curriculum: 3-phase training (phonetic โ†’ mixed โ†’ code-mix focused)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

MODEL_ID = "Pudamya/Gemma3-Singlish-Codemix"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16, device_map='auto')

def translate(text):
    prompt = (
        "### Instruction:\n"
        "Convert the following code-mixed Singlish-English sentence into proper Sinhala script.\n\n"
        f"### Input:\n{text}\n\n"
        "### Response:\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=False,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    decoded = tokenizer.decode(out[0], skip_special_tokens=True)
    return decoded.split("### Response:")[-1].strip()

print(translate("mama api ekka movie eke gihin fun hari thibba"))

Languages

  • Input: Romanized Sinhala / Singlish / Code-mixed Sinhala-English
  • Output: Sinhala script (Unicode)
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Pudamya/Gemma3-Singlish-Codemix

Adapter
(3)
this model