Greek Dialect LoRA — Llama-3.1 8B Instruct Adapter
LoRA adapter that transfers the dialect-focused training recipe to Meta’s Llama-3.1 8B Instruct base. Built by the CLLT Lab (University of Crete) to offer a fully open-weight alternative to the Meta Llama 3 release while keeping identical data processing and hyperparameters.
Project website: https://stergioscha.github.io/CLLT/
Model Details
- Developer: CLLT Lab, University of Crete
- Base model: meta-llama/Llama-3.1-8B-Instruct
- Adapter: LoRA (r=16, α=32, dropout=0.1 on q/k/v/o/gate/up/down projections) via PEFT
- Languages: Greek dialects (Pontic, Cretan, Northern Greek, Cypriot)
- Training data: 23k+ natural prompts derived from GRDD / GRDD+ with Standard Greek removed
- Purpose: Research-only release aimed at cultural preservation and dialect-aware NLP experimentation
Sources
- GitHub: https://github.com/StergiosCha/krikri_dialectal
- Dataset: https://github.com/StergiosCha/Greek_dialect_corpus
- Website: https://stergioscha.github.io/CLLT/
Intended Use
Suitable for
- Prompting Llama-3.1 8B to reply in a chosen dialect via natural Greek instructions
- Prototyping tools that surface dialectal variants in educational or cultural applications
- Comparing adapter behaviour across the three released backbones (Krikri, Llama-3, Llama-3.1)
Not suitable for
- Production-facing deployments without careful evaluation
- Safety-critical, legal, or medical settings
- Translation or dialect identification tasks (the adapter only generates text)
- Standard Modern Greek output (training data intentionally omits it)
Quickstart
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3.1-8b-instruct-lora")
prompt = "Γράψε στην κυπριακή διάλεκτο: Πώς πάει η μέρα σου;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=150, temperature=0.75)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Summary
- Preprocessing: same natural-prompt conversion pipeline as other adapters, 95/5 train/val split, 512-token truncation
- Hyperparameters: epochs=3, per-device batch size=2, grad accumulation=8, lr=3e-4, warmup=100, LoRA config identical across models
- Precision: bfloat16,
device_map="auto"for multi-GPU utilisation - Artifacts:
adapter_model.safetensors(~170 MB), tokenizer + chat template included for convenience
Evaluation & Monitoring
- Automatic validation loss tracking; best checkpoint saved automatically
- Manual inspection by native speakers recommended for dialect fidelity, register, and orthography
Risks & Limitations
- Dialect knowledge limited to what is present in GRDD; lesser-documented varieties may be under-represented
- Users should specify the dialect clearly—generic prompts may fall back to Standard Greek
- The adapter inherits biases from both the base model and GRDD (topics, register, gender representation, etc.)
Acknowledgments
- AWS compute courtesy of GRNET
- Funding by the EU Recovery and Resilience Facility
- Meta for providing the Llama-3.1 base weights
Contact
Please open an issue in the GitHub repository or contact the CLLT Lab (University of Crete) for questions and collaboration.
- Downloads last month
- 33
Model tree for Stergios/llama3.1-8b-instruct-lora
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct