kimi-k2.6-eagle3-mla
Model Overview
kimi-k2.6-eagle3-mla is an Eagle3 MTP draft model with MLA (Multi-Latent Attention) for accelerating inference of Kimi-K2.6, trained with TorchSpec โ an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a star on GitHub.
Why an MLA (Multi-Latent Attention) Draft Model
Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.6 deployment:
- Uses less KV cache, which reduces serving memory pressure.
- Matches Kimi-K2.6's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.
Training Setup
- Cluster: 3 nodes ร 8ร B200 (24 GPUs total)
- Training: 1 node (8 GPUs), FSDP
- Inference: 2 nodes (16 GPUs), vLLM (TP=8 per node)
- Continual training: Initialized from kimi-k2.5-eagle3-mla checkpoint
- Iterations: 9,279 steps
- Learning rate: 2e-5, cosine schedule
Performance
The primary metric is accept_length โ the average number of tokens accepted per speculation step with num_speculative_tokens=3. Higher is better.
Benchmarks were run on vLLM 0.20.0 with 8ร B200 GPUs.
| Category | Benchmark | N | Accept Length |
|---|---|---|---|
| Dialogue | MTBench | 80 | 2.624 |
| Chinese | CEval | 212 | 2.494 |
| Math | GSM8K | 500 | 2.987 |
| Code | HumanEval | 164 | 3.241 |
| Math | MATH500 | 500 | 3.245 |
| Math | AIME | 30 | 2.982 |
| Code | LiveCodeBench | 200 | 2.706 |
| Code | SPEED-Bench (coding) | 80 | 3.006 |
Quick Start
Requirements
- NVIDIA GPU with CUDA 12.0+
- vLLM >= 0.20.0
Launch Server (vLLM)
vllm serve moonshotai/Kimi-K2.6 \
--tensor-parallel-size 8 \
--speculative-config '{"model": "lightseekorg/kimi-k2.6-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
--trust-remote-code
Launch Server (SGLang)
MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.
Citation
@misc{torchspec2026,
title={TorchSpec: An Online Speculative Decoding Training Framework},
url={https://github.com/torchspec-project/TorchSpec},
year={2026}
}
- Downloads last month
- 1,344
Model tree for lightseekorg/kimi-k2.6-eagle3-mla
Base model
moonshotai/Kimi-K2.6