kimi-k2.5-eagle3-mla

Model Overview

kimi-k2.5-eagle3-mla is an Eagle3 MTP draft model with MLA(Multi-Latent-Attention) for accelerating inference of Kimi-K2.5, trained with TorchSpec - an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a star on GitHub.

Why an MLA (Multi-Latent Attention) Draft Model

Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.5 deployment:

Uses less KV cache, which reduces serving memory pressure.
Matches Kimi-K2.5's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.

Training Setup

Cluster: 4 nodes x 8x H200 (32 GPUs total)
Training: 2 nodes (16 GPUs), FSDP
Inference: 2 nodes (16 GPUs), Engine (TP=8 per node)
Duration: ~14 hours per phase:

Dataset: Regenerated open-perfectblend dataset

All training responses were regenerated by Kimi-K2.5 via Engine to match the base model's exact token distribution.

Performance

The primary metric is accept_length - the average number of tokens accepted per speculation step with topk=1, num_steps=3, num_draft_tokens=4. Higher is better.

Benchmarks were run using lm_eval.

Category	Benchmark	N	Acc Len
Dialogue	MTBench	80	2.940
Chinese	CEval	212	2.829
Math	GSM8K	500	3.017
Code	HumanEval	164	2.969
Math	MATH500	500	3.051
Math	AIME	30	3.139
VL	MMStar	200	2.597

Quick Start

Requirements

NVIDIA GPU with CUDA 12.0+
vLLM >= 0.18.0, or install the nightly wheel/docker image

Launch Server (vLLM)

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
    --trust-remote-code

For deployment configuration, refer to the official vLLM recipes.

Launch Server (SGLang)

MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.

Run Benchmarks

lm_eval \
  --model local-completions \
  --model_args base_url=<url> \
  --tasks gsm8k \
  --batch_size 16

Downloads last month: 41,892

Model tree for lightseekorg/kimi-k2.5-eagle3-mla

Base model

moonshotai/Kimi-K2.5

Finetuned

(39)

this model