Greek Dialect LoRA — Llama-3 8B Instruct Adapter

LoRA adapter trained by the CLLT Lab (University of Crete) for dialectal Greek generation on top of meta-llama/Meta-Llama-3-8B-Instruct. The adapter follows the same natural-prompt pipeline as the Krikri variant but leverages Meta’s latest instruct-tuned backbone. Training completed 4,173 steps (3 epochs) with the best checkpoint at step 4,000 (eval loss 1.874).

Project website: https://stergioscha.github.io/CLLT/

Model Details

  • Developer: CLLT Lab, University of Crete
  • Adapter type: LoRA (PEFT) with r=16, α=32, dropout=0.1 applied to q/k/v/o/gate/up/down projections
  • Dataset: 23k+ instruction-following pairs covering Pontic, Cretan, Northern, Cypriot dialects (derived from GRDD)
  • Split: 95% train / 5% validation using Hugging Face datasets random split
  • Precision: bfloat16, gradient accumulation 8 → effective batch size 16
  • License: Research purposes only, subject to the Meta Llama 3 license terms
  • Compute: AWS GPU resources via GRNET & EU Recovery and Resilience Facility funding

Sources

Intended Use

Direct

  • Generate or continue prompts in specific Greek dialects for cultural documentation or experimentation
  • Build dialogue systems that can answer in Pontic, Cretan, Northern Greek, or Cypriot when prompted explicitly

Downstream

  • Plug into RAG/chat pipelines that rely on Meta-Llama-3-8B-Instruct as a base
  • Evaluate dialectal control against GRDD+ or bespoke benchmarks

Out-of-scope

  • Critical or safety-sensitive deployments without native-speaker review
  • Automatic translation or identification of dialects (model produces text; it is not a classifier)
  • Standard Modern Greek generation (training data removed it)

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = PeftModel.from_pretrained(base, "Stergios/llama3-8b-instruct-lora")

prompt = "Απάντησε στα κρητικά: Πού θα συναντηθούμε;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=160, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

  • Preparation: convert_to_natural_prompts_dialects_only.py converts <po>/<cr>/<no>/<cy> tags to friendly Greek instructions (e.g., “Γράψε στην κρητική διάλεκτο: …”).
  • Filtering: Removed Standard Modern Greek entries to keep the adapter dialect-focused.
  • Tokenization: 512 tokens, padding to max length, labels = input IDs.
  • Hyperparameters: epochs=3, lr=3e-4, warmup=100, save/eval every 200 steps, load_best_model_at_end=True.
  • Checkpoint size: adapter ≈ 170 MB (adapter_model.safetensors).

Evaluation

  • Automatic: Validation loss tracked every 200 steps; best checkpoint at step 4,000 (eval loss 1.874).
  • Recommended manual checks: Have native speakers verify correctness, register, and cultural sensitivity.

Limitations & Risks

  • Dialect mixing can occur if prompts are vague. Specify the dialect explicitly.
  • Model inherits any biases present in GRDD (topics, speaker demographics, orthography).
  • Llama 3 family license disallows certain use cases—comply with Meta’s terms alongside the “research only” clause here.

Acknowledgments

  • National Infrastructures for Research and Technology (GRNET) for AWS credits
  • EU Recovery & Resilience Facility for funding
  • Meta for the base Llama 3 models

Contact

Questions or issues? Open an issue on the GitHub repository or reach out to the CLLT Lab (University of Crete).

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stergios/llama3-8b-instruct-lora

Adapter
(2057)
this model