Greek Dialect LoRA — Llama-3 8B Instruct Adapter
LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of meta-llama/Meta-Llama-3-8B-Instruct. The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874).
Project website: https://stergioscha.github.io/CLLT/
Model Details
- Developer: CLLT Lab, University of Crete
- Adapter type: LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections
- Dataset: 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD)
- Split: 95% train / 5% validation using Hugging Face
datasetsrandom split - Precision: bfloat16, gradient accumulation 8 → effective batch size 16
- License: Research purposes only, subject to the Meta Llama 3 license terms
- Compute: AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding
Sources
- GitHub: https://github.com/StergiosCha/krikri_dialectal
- Dataset: https://github.com/StergiosCha/Greek_dialect_corpus
- Website: https://stergioscha.github.io/CLLT/
Intended Use
Direct
- Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation
- Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly
Downstream
- Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base
- Evaluate dialectal control against GRDD+ or bespoke benchmarks
Out-of-scope
- Critical or safety-sensitive deployments without native-speaker review
- Automatic translation or identification of dialects (model produces text; it is not a classifier)
- Standard Modern Greek generation (training data removed it)
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora")
prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Data & Procedure
- Preparation:
convert_to_natural_prompts_dialects_only.pyconverts<po>/<cr>/<no>/<cy>tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”). - Filtering: Removed Standard Modern Greek entries to keep the adapter dialect-focused.
- Tokenization: 512 tokens, padding to max length, labels = input IDs.
- Hyperparameters: epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps,
load_best_model_at_end=True. - Checkpoint size: adapter ≈ 170 MB (
adapter_model.safetensors).
Evaluation
- Automatic: Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874).
- Recommended manual checks: Have native speakers verify correctness, register, and cultural sensitivity.
Limitations & Risks
- Dialect mixing can occur if prompts are vague. Specify the dialect explicitly.
- Model inherits any biases present in GRDD (topics, speaker demographics, orthography).
- Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here.
Acknowledgments
- National Infrastructures for Research and Technology (GRNET) for AWS credits
- EU Recovery & Resilience Facility for funding
- Meta for the base Llama 3 models
Contact
Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).
- Downloads last month
- 31
Model tree for Stergios/llama3-8b-instruct-lora
Base model
meta-llama/Meta-Llama-3-8B-Instruct