Instructions to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509") model = LlamaForCausalLMEagle3.from_pretrained("thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509
- SGLang
How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 with Docker Model Runner:
docker model run hf.co/thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509
EAGLE3-Apertus-8B-Instruct-2509
An Eagle3 draft model for speculative decoding with swiss-ai/Apertus-8B-Instruct-2509.
Model Description
This is a lightweight draft model trained to accelerate inference of Apertus-8B-Instruct through speculative decoding. Eagle3 uses a single-layer architecture that predicts future tokens by leveraging the target model's hidden states.
| Property | Value |
|---|---|
| Architecture | LlamaForCausalLMEagle3 |
| Hidden Size | 4096 |
| Intermediate Size | 21504 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Layers | 1 |
| Vocab Size | 131,072 |
| Draft Vocab Size | 32,000 |
| Precision | bfloat16 |
| Parameters | ~513M |
Training Details
- Framework: SpecForge
- Target Model: swiss-ai/Apertus-8B-Instruct-2509
- Epochs: 10
- Batch Size: 1 per GPU
- Learning Rate: 1e-4
- Max Sequence Length: 4096
- Hardware: 64 GPUs (16 nodes × 4 GPUs)
- Precision: bfloat16
Training Data
The model was trained on ~375k samples of regenerated conversation data. The dataset consists of prompts from:
The responses were regenerated using Apertus-8B-Instruct-2509 to ensure the draft model learns from the target model's own output distribution.
See: thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data
Usage
With vLLM
VLLM_USE_V1=1 vllm serve swiss-ai/Apertus-8B-Instruct-2509 \
--speculative-config '{"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3"}'
Or in Python:
from vllm import LLM, SamplingParams
llm = LLM(
model="swiss-ai/Apertus-8B-Instruct-2509",
speculative_config={
"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
"num_speculative_tokens": 3,
"method": "eagle3",
},
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)
With SGLang
python -m sglang.launch_server \
--model swiss-ai/Apertus-8B-Instruct-2509 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 32
Continue Training
To resume training from this checkpoint:
- Clone SpecForge
- Download the training dataset from thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data
- Download this checkpoint and place it in a subdirectory of your output directory (e.g.,
outputs/apertus-8b-eagle3/epoch_9_step_55000/) - Run with
--resume(it will automatically find the last checkpoint in--output-dir):
NUM_GPUS=4
TP_SIZE=1
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
scripts/train_eagle3.py \
--target-model-path swiss-ai/Apertus-8B-Instruct-2509 \
--draft-model-config /path/to/configs/apertus-8b-eagle3.json \
--train-data-path /path/to/merged_train_regen.jsonl \
--output-dir /path/to/outputs/apertus-8b-eagle3 \
--num-epochs 15 \
--batch-size 1 \
--tp-size $TP_SIZE \
--learning-rate 1e-4 \
--max-length 4096 \
--chat-template apertus \
--cache-dir /path/to/cache \
--target-model-backend sglang \
--resume
The --resume flag uses get_last_checkpoint() to automatically find the most recent checkpoint in the output directory.
License
Apache 2.0
Citation
If you use this model, please cite Eagle3:
@article{li2025eagle3,
title={Eagle 3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
journal={arXiv preprint arXiv:2503.01840},
year={2025}
}
Acknowledgments
Trained on the Alps supercomputer at CSCS (Swiss National Supercomputing Centre).
- Downloads last month
- 4
Model tree for thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509
Base model
swiss-ai/Apertus-8B-2509