kimi-k2.5-eagle3-mla
Model Overview
kimi-k2.5-eagle3-mla is an Eagle3 MTP draft model with MLA(Multi-Latent-Attention) for accelerating inference of Kimi-K2.5, trained with TorchSpec - an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a star on GitHub.
Why an MLA (Multi-Latent Attention) Draft Model
Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.5 deployment:
- Uses less KV cache, which reduces serving memory pressure.
- Matches Kimi-K2.5's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.
Training Setup
- Cluster: 4 nodes x 8x H200 (32 GPUs total)
- Training: 2 nodes (16 GPUs), FSDP
- Inference: 2 nodes (16 GPUs), Engine (TP=8 per node)
- Duration: ~14 hours per phase:
Dataset: Regenerated open-perfectblend dataset
All training responses were regenerated by Kimi-K2.5 via Engine to match the base model's exact token distribution.
Performance
The primary metric is accept_length - the average number of tokens accepted per speculation step with topk=1, num_steps=3, num_draft_tokens=4. Higher is better.
Benchmarks were run using lm_eval.
| Category | Benchmark | N | Acc Len |
|---|---|---|---|
| Dialogue | MTBench | 80 | 2.940 |
| Chinese | CEval | 212 | 2.829 |
| Math | GSM8K | 500 | 3.017 |
| Code | HumanEval | 164 | 2.969 |
| Math | MATH500 | 500 | 3.051 |
| Math | AIME | 30 | 3.139 |
| VL | MMStar | 200 | 2.597 |
Quick Start
Requirements
- NVIDIA GPU with CUDA 12.0+
- vLLM >= 0.18.0, or install the nightly wheel/docker image
Launch Server (vLLM)
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 8 \
--speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
--trust-remote-code
For deployment configuration, refer to the official vLLM recipes.
Launch Server (SGLang)
MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.
Run Benchmarks
lm_eval \
--model local-completions \
--model_args base_url=<url> \
--tasks gsm8k \
--batch_size 16
- Downloads last month
- 41,892
Model tree for lightseekorg/kimi-k2.5-eagle3-mla
Base model
moonshotai/Kimi-K2.5