Instructions to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509")
model = LlamaForCausalLMEagle3.from_pretrained("thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509

SGLang

How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with Docker Model Runner:
```
docker model run hf.co/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509
```

EAGLE3-Apertus-8B-Instruct-2509

An Eagle3 draft model for speculative decoding with swiss-ai/Apertus-8B-Instruct-2509.

Model Description

This is a lightweight draft model trained to accelerate inference of Apertus-8B-Instruct through speculative decoding. Eagle3 uses a single-layer architecture that predicts future tokens by leveraging the target model's hidden states.

Property	Value
Architecture	`LlamaForCausalLMEagle3`
Hidden Size	4096
Intermediate Size	21504
Attention Heads	32
KV Heads	8
Layers	1
Vocab Size	131,072
Draft Vocab Size	32,000
Precision	bfloat16
Parameters	~513M

Training Details

Framework: SpecForge
Target Model: swiss-ai/Apertus-8B-Instruct-2509
Epochs: 10
Batch Size: 1 per GPU
Learning Rate: 1e-4
Max Sequence Length: 4096
Hardware: 64 GPUs (16 nodes × 4 GPUs)
Precision: bfloat16

Training Data

The model was trained on ~375k samples of regenerated conversation data. The dataset consists of prompts from:

The responses were regenerated using Apertus-8B-Instruct-2509 to ensure the draft model learns from the target model's own output distribution.

See: thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data

Usage

With vLLM

VLLM_USE_V1=1 vllm serve swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-config '{"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3"}'

Or in Python:

from vllm import LLM, SamplingParams

llm = LLM(
    model="swiss-ai/Apertus-8B-Instruct-2509",
    speculative_config={
        "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
        "num_speculative_tokens": 3,
        "method": "eagle3",
    },
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

With SGLang

python -m sglang.launch_server \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32

Continue Training

To resume training from this checkpoint:

Clone SpecForge
Download the training dataset from thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data
Download this checkpoint and place it in a subdirectory of your output directory (e.g., outputs/apertus-8b-eagle3/epoch_9_step_55000/)
Run with --resume (it will automatically find the last checkpoint in --output-dir):

NUM_GPUS=4
TP_SIZE=1

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3.py \
    --target-model-path swiss-ai/Apertus-8B-Instruct-2509 \
    --draft-model-config /path/to/configs/apertus-8b-eagle3.json \
    --train-data-path /path/to/merged_train_regen.jsonl \
    --output-dir /path/to/outputs/apertus-8b-eagle3 \
    --num-epochs 15 \
    --batch-size 1 \
    --tp-size $TP_SIZE \
    --learning-rate 1e-4 \
    --max-length 4096 \
    --chat-template apertus \
    --cache-dir /path/to/cache \
    --target-model-backend sglang \
    --resume

The --resume flag uses get_last_checkpoint() to automatically find the most recent checkpoint in the output directory.

License

Apache 2.0

Citation

If you use this model, please cite Eagle3:

@article{li2025eagle3,
  title={Eagle 3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Acknowledgments

Trained on the Alps supercomputer at CSCS (Swiss National Supercomputing Centre).

Downloads last month: 4

Safetensors

Model size

0.5B params

Tensor type

I64

BF16

BOOL

Model tree for thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509

Base model

swiss-ai/Apertus-8B-2509

Finetuned

swiss-ai/Apertus-8B-Instruct-2509

Finetuned

(15)

this model

Paper for thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 9