X-EcoMLA: pcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
X-EcoMLA is an efficient KV cache compression technique for large language models (LLMs) proposed by AMD that upcycles transformer blocks into Multi-head Latent Attention (MLA) for extreme KV cache compression and computational efficiency.
Instead of training a MLA model from scratch, the proposed X-EcoMLA first initializes the MLA weights based on Singular Value Decomposition (SVD) of the existing transformer weights, followed by lightweight pre-training or post-training distillation.
This model, X-EcoMLA-3B3B-fixed-kv816-DPO, is created by efficiently adapting the pre-trained Llama-3.2-3B-Instruct model conducted post-training on AMD Instinct™ MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch.
Key Takeaways
- Announcing X-EcoMLA, an efficient approach to upcycle existing transformer blocks into MLA.
- Extreme KV Cache Compression: X-EcoMLA dramatically reduces the KV cache size by 6.4x - 10.6x with only 3.6B - 7B training tokens, while preserving almost 100% of its average zero-shot performance on LM Harness tasks.
- Novel SVD Initialization: X-EcoMLA employs an efficient SVD-based weight initialization which dramatically improves the training efficiency and model performance.
Model Composition Pipeline
The X-EcoMLA models are not trained from scratch. Instead, they are composed from powerful pre-trained Transformers through a lightweight and efficient pipeline. The creation of this model followed these stages:
| Stage | Action | Description |
|---|---|---|
| 1. Base Model | Llama-3.2-3B-Instruct | The starting point is a high-quality, pre-trained Transformer model. |
| 2. Initialization | Structured Weight Mapping | MLA models are initialized from the base model's weights using SVD. |
| 3. SFT | End-to-End Knowledge Distillation | The initialized model is fine-tuned via knowledge distillation. |
| 4. Alignment | Direct Preference Optimization (DPO) | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability. |
Training Data
| Stage | Dataset | License |
|---|---|---|
| SFT | https://huggingface.co/datasets/teknium/OpenHermes-2.5 | Refer source materials |
| SFT | https://huggingface.co/datasets/tomg-group-umd/GenQA | CC BY-NC 4.0 |
| SFT | https://huggingface.co/datasets/BAAI/Infinity-Instruct | CC BY-SA 4.0 |
| DPO | https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized | MIT |
| DPO | https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs | MIT |
| DPO | https://huggingface.co/datasets/JunxiongWang/llama3-ultrafeedback-armorm | MIT |
Getting Started
Installation
git clone https://github.com/AMD-AIG-AIMA/AMD-Hybrid-Models.git
cd AMD-Hybrid-Models/X-EcoMLA
Then follow the installation instruction in AMD-AIG-AIMA/AMD-Hybrid-Models repo.
Example Usage
Once the installation completed, we can try the following code for a quick test
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from mla.hybrid_wrapper import MLATransformerHybridModelWrapper
checkpoint = "amd/X-EcoMLA-3B3B-fixed-kv816-DPO"
model = MLATransformerHybridModelWrapper.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model.eval()
# Format the prompt using the chat template
prompt = [{"role": "user", "content": "What are the benefits of hybrid language models?"}]
input_ids = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
).cuda()
# Generate a response
tokens = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
Model Evaluation:
python benchmark/llm_eval/lm_harness_eval.py \
--model mla_hybrid \
--model_args pretrained="amd/X-EcoMLA-3B3B-fixed-kv816-DPO" \
--tasks mmlu,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa,pubmedqa,race \
--num_fewshot 0 --device cuda --batch_size 16
Model details
| Model | KV Size | Target Model | Teacher Model | Training Tokens | Pre-/Post-Training | rkv | rq | drope | dnope |
|---|---|---|---|---|---|---|---|---|---|
| X-EcoMLA-1B1B-fixed-kv512-DPO | 53.1% | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | 7B | Post | 512 | 864 | 32 | 32 |
| X-EcoMLA-1B1B-dynamic-0.95-DPO | 54.7% | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | 7B | Post | 0.95 | 0.95 | 32 | 32 |
| X-EcoMLA-1B8B-fixed-kv64-DPO | 9.4% | Llama-3.2-1B-Instruct | Llama-3.1-8B-Instruct | 7B | Post | 64 | 1424 | 32 | 32 |
| X-EcoMLA-3B3B-fixed-kv816-DPO | 43% | Llama-3.2-3B-Instruct | Llama-3.2-3B-Instruct | 7B | Post | 816 | 1536 | 64 | 64 |
| X-EcoMLA-3B3B-dynamic-0.95-DPO | 43% | Llama-3.2-3B-Instruct | Llama-3.2-3B-Instruct | 7B | Post | 0.95 | 0.95 | 64 | 64 |
| X-EcoMLA-SmolLM-1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | - | 6B | Pre | 480 | 2048 | 32 | 32 |
| X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | SmolLM-1.7B | 6B | Pre | 480 | 2048 | 32 | 32 |
| X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-DPO | 12.5% | SmolLM-1.7B-Instruct | SmolLM-1.7B-Instruct | 7B | Post | 480 | 2048 | 32 | 32 |
Benchmark results
X-EcoMLA was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
| Tasks | Metric | Llama-3.2-3B-Instruct | X-EcoMLA-3B3B-fixed-kv816-DPO | X-EcoMLA-3B3B-dynamic-0.95-DPO |
|---|---|---|---|---|
| arc_challenge | acc | 0.4369±0.0145 | 0.4753±0.0146 | 0.4710±0.0146 |
| acc_norm | 0.4590±0.0146 | 0.4821±0.0146 | 0.4846±0.0146 | |
| arc_easy | acc | 0.7428±0.0090 | 0.7660±0.0087 | 0.7580±0.0088 |
| acc_norm | 0.6776±0.0096 | 0.7045±0.0094 | 0.6999±0.0094 | |
| hellaswag | acc | 0.5222±0.0050 | 0.5288±0.0050 | 0.5320±0.0050 |
| acc_norm | 0.7036±0.0046 | 0.7224±0.0045 | 0.7226±0.0045 | |
| mmlu | acc | 0.6046±0.1057 | 0.5742±0.1014 | 0.5773±0.1028 |
| - humanities | acc | 0.5926±0.0826 | 0.5507±0.0843 | 0.5518±0.0851 |
| - other | acc | 0.6598±0.1118 | 0.6312±0.1011 | 0.6344±0.1070 |
| - social_sciences | acc | 0.6701±0.0712 | 0.6383±0.0741 | 0.6422±0.0765 |
| - stem | acc | 0.5043±0.1122 | 0.4906±0.1089 | 0.4960±0.1071 |
| openbookqa | acc | 0.2740±0.0200 | 0.2920±0.0204 | 0.3000±0.0205 |
| acc_norm | 0.3620±0.0215 | 0.3840±0.0218 | 0.3940±0.0219 | |
| piqa | acc | 0.7606±0.0100 | 0.7573±0.0100 | 0.7579±0.0100 |
| acc_norm | 0.7557±0.0100 | 0.7655±0.0099 | 0.7579±0.0100 | |
| pubmedqa | acc | 0.6960±0.0206 | 0.6680±0.0211 | 0.6840±0.0208 |
| race | acc | 0.4077±0.0152 | 0.4622±0.0154 | 0.4632±0.0154 |
| winogrande | acc | 0.6717±0.0132 | 0.6859±0.0130 | 0.6590±0.0133 |
Conclusion
X-EcoMLA demonstrates an efficient technique to upcycle pre-trained Transformers into MLA modules to compress KV cache. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
Bias, Risks, and Limitations
- This model is a research artifact and has not been evaluated for safety in production use cases.
- The model's performance is dependent on the quality of its pre-trained base model and the teacher model used during distillation. Its capabilities and biases are inherited from these sources.
- The model may generate content that is factually inaccurate, biased, or otherwise objectionable. Users should be aware of these risks and implement appropriate safeguards for their applications.
- One limitation of this work is the reliance on a strong teacher model for knowledge transfer, which may not always be available. Distillation from a teacher also adds to the resource requirements during the post-training phase.
Citation
If you find this model useful, please consider citing the original paper:
@article{li2025x,
title={X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression},
author={Li, Guihong and Rezagholizadeh, Mehdi and Yang, Mingyu and Appia, Vikram and Barsoum, Emad},
journal={arXiv preprint arXiv:2503.11132},
year={2025}
}
- Downloads last month
- 1
Model tree for amd/X-EcoMLA-3B3B-fixed-kv816-DPO
Base model
meta-llama/Llama-3.2-3B-Instruct