Greek Dialect LoRA — Llama-3 8B Instruct Adapter

LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of meta-llama/Meta-Llama-3-8B-Instruct. The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874).

Project website: https://stergioscha.github.io/CLLT/

Model Details

Developer: CLLT Lab, University of Crete
Adapter type: LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections
Dataset: 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD)
Split: 95% train / 5% validation using Hugging Face datasets random split
Precision: bfloat16, gradient accumulation 8 → effective batch size 16
License: Research purposes only, subject to the Meta Llama 3 license terms
Compute: AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding

Sources

GitHub: https://github.com/StergiosCha/krikri_dialectal
Dataset: https://github.com/StergiosCha/Greek_dialect_corpus
Website: https://stergioscha.github.io/CLLT/

Intended Use

Direct

Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation
Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly

Downstream

Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base
Evaluate dialectal control against GRDD+ or bespoke benchmarks

Out-of-scope

Critical or safety-sensitive deployments without native-speaker review
Automatic translation or identification of dialects (model produces text; it is not a classifier)
Standard Modern Greek generation (training data removed it)

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora")

prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

Preparation: convert_to_natural_prompts_dialects_only.py converts <po>/<cr>/<no>/<cy> tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”).
Filtering: Removed Standard Modern Greek entries to keep the adapter dialect-focused.
Tokenization: 512 tokens, padding to max length, labels = input IDs.
Hyperparameters: epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps, load_best_model_at_end=True.
Checkpoint size: adapter ≈ 170 MB (adapter_model.safetensors).

Evaluation

Automatic: Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874).
Recommended manual checks: Have native speakers verify correctness, register, and cultural sensitivity.

Limitations & Risks

Dialect mixing can occur if prompts are vague. Specify the dialect explicitly.
Model inherits any biases present in GRDD (topics, speaker demographics, orthography).
Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here.

Acknowledgments

National Infrastructures for Research and Technology (GRNET) for AWS credits
EU Recovery & Resilience Facility for funding
Meta for the base Llama 3 models

Contact

Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).

Downloads last month: 31

Model tree for Stergios/llama3-8b-instruct-lora

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(2057)

this model