kimi-k2.5-eagle3-mla

Model Overview

kimi-k2.5-eagle3-mla is an Eagle3 MTP draft model with MLA(Multi-Latent-Attention) for accelerating inference of Kimi-K2.5, trained with TorchSpec - an online speculative decoding training framework that runs FSDP training and inference concurrently. If you find this draft model useful, please give our project TorchSpec a star on GitHub.

Why an MLA (Multi-Latent Attention) Draft Model

Compared with an MHA draft model, the MLA variant is a better fit for Kimi-K2.5 deployment:

  • Uses less KV cache, which reduces serving memory pressure.
  • Matches Kimi-K2.5's MLA architecture, so it fits more naturally into the inference engine's KV-cache handling under different serving scenarios such as PD-Disaggregation.

Training Setup

  • Cluster: 4 nodes x 8x H200 (32 GPUs total)
  • Training: 2 nodes (16 GPUs), FSDP
  • Inference: 2 nodes (16 GPUs), Engine (TP=8 per node)
  • Duration: ~14 hours per phase:

Dataset: Regenerated open-perfectblend dataset

All training responses were regenerated by Kimi-K2.5 via Engine to match the base model's exact token distribution.

Performance

The primary metric is accept_length - the average number of tokens accepted per speculation step with topk=1, num_steps=3, num_draft_tokens=4. Higher is better.

Benchmarks were run using lm_eval.

Category Benchmark N Acc Len
Dialogue MTBench 80 2.940
Chinese CEval 212 2.829
Math GSM8K 500 3.017
Code HumanEval 164 2.969
Math MATH500 500 3.051
Math AIME 30 3.139
VL MMStar 200 2.597

Quick Start

Requirements

Launch Server (vLLM)

vllm serve moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --speculative-config '{"model": "lightseekorg/kimi-k2.5-eagle3-mla", "method": "eagle3", "num_speculative_tokens": 3}' \
    --trust-remote-code

For deployment configuration, refer to the official vLLM recipes.

Launch Server (SGLang)

MLA Eagle3 draft model is not yet supported in SGLang. Will update once support is available.

Run Benchmarks

lm_eval \
  --model local-completions \
  --model_args base_url=<url> \
  --tasks gsm8k \
  --batch_size 16
Downloads last month
41,892
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lightseekorg/kimi-k2.5-eagle3-mla

Finetuned
(39)
this model