Eagle 2.5

[📂Homepage] [📂GitHub] [📜Tech Report] [🤗HF Demo]

Introduction

Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.

We also introduce Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. The dataset contains over 110K annotated samples, including QA, localization, and summarization. The videos range from a few minutes to 3 hours - pushing the limits of long-form visual reasoning.

Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, offering a robust solution to the limitations of existing VLMs. Notably, Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial models such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B, despite having significantly fewer parameters.

🚀Strong Results Across The Board:

SOTA on 6 out of 10 long video benchmarks
Outperforms GPT-4o (0806) on 3/5 video tasks
Outperforms Gemini 1.5 Pro on 4/6 video tasks
Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
72.4% on Video-MME with 512 input frames
Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.

🎯Key Innovations

Information-First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
Progressive Mixed Post-Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
Diversity-Driven Data Recipe:
- Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.

⚡Efficiency & Framework Optimization

GPU Memory Optimization:
- Integrate Triton-based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
Distributed Context Parallelism:
- Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.

Model Details

Model Type: Long-context vision-language model
Architecture:
- Vision encoder: Siglip2-So400m-Patch16-512
- Language model: Qwen2.5-7B-Instruct
- Multimodal base architecture: LLaVA with tiling-based vision input
Supported Inputs:
- Long video sequences (up to 512 frames)
- High-resolution images (up to 4K HD input size)
- Multi-page documents
- Long text
Training Strategy:
- Progressive mixed post-training, expanding from 32K to 128K context length
- Information-first sampling for optimal visual and textual information retention
Training Data:
- Open-source video and document datasets
- Eagle-Video-110K (110K long videos with dual-level annotation)

Released Models

Model	Date	Download Link	Notes
Eagle2.5-8B	2025.04.16	HF link	Long video (512 frames), high-res support

Video Benchmarks

Benchmark	GPT-4o	Gemini-1.5 Pro	InternVL2.5-8B	Qwen2.5-VL-8B	Eagle2.5-8B
MVBench_test	-	-	72.0	69.6	74.8
Perception_test_val	-	-	-	70.5	82.0
EgoSchema_fullset	-	72.2	-	65.0	72.2
MMB-Video	1.63	1.30	1.68	1.79	1.94
MLVU_val	-	-	68.9	70.2	77.6
LVBench_val	66.7	64.0	60.0	56.0	66.4
Video-MME_{w/o subtitle}	71.9	75.0	64.2	65.1	72.4
Video-MME_{w subtitle}	77.2	81.3	66.9	71.6	75.7
CG-Bench_Clue	58.6	50.9	-	44.5	55.8
CG-Bench_Long	44.9	37.8	-	35.5	46.6
CG-Bench_mIoU	5.73	3.85	-	2.48	13.4
HourVideo_Dev	-	37.2	-	-	44.5
HourVideo_Test	-	37.4	-	-	41.8
Charade-STA_mIoU	35.7	-	-	43.6	65.9
HD-EPIC	-	37.6	-	-	42.9
HRVideoBench	-	-	-	-	68.5
EgoPlan_val	-	-	-	-	45.3

Embodied Benchmarks

Benchmark	GPT-4o	Gemini-1.5 Pro	InternVL2.5-8B	Qwen2.5-VL-8B	Eagle2.5-8B
OpenEQA	-	-	-	-	63.5
ERQA	47.0	41.8	-	-	38.3
EgoPlan_val	-	-	-	-	45.3

Image Benchmarks

Benchmark	GPT-4o	Gemini-1.5 Pro	InternVL2.5-8B	Qwen2.5-VL-8B	Eagle2.5-8B
DocVQA_test	92.8	93.1	93.0	95.7	94.1
ChartQA_test	85.7	87.2	84.8	87.3	87.5
InfoVQA_test	79.2	81.0	77.6	82.6	80.4
TextVQA_val	77.4	78.8	79.1	84.9	83.7
OCRBench_test	736	754	822	864	869
MMstar_test	64.7	59.1	62.8	63.9	66.2
RWQA_test	75.4	67.5	70.1	68.5	76.7
AI2D_test	84.6	79.1	84.5	83.9	84.5
MMMU_val	69.1	62.2	56.0	58.6	55.8
MMBench_V11_test	83.1	74.6	83.2	82.6	81.7
MMVet_GPT-4-Turbo	69.1	64.0	62.8	67.1	62.9
HallBench_avg	55.0	45.6	50.1	52.9	54.7
MathVista_testmini	63.8	63.9	64.4	68.2	67.8
Avg Score	74.9	71.7	73.1	75.6	75.6

All numbers are directly extracted from Table 2 and Table 3 of the Eagle 2.5 Tech Report.

Eagle 2.5-8B matches or surpasses the performance of much larger models on long-context video and image benchmarks.

Quick Start

pip install transformers==4.51.0

single image

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

stream generation

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch

from transformers import TextIteratorStreamer
import threading


model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    max_new_tokens=1024,
    do_sample=True,
    top_p=0.95,
    temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()


for new_text in streamer:
    print(new_text, end="", flush=True)

multiple-images

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {
                "type": "image",
                "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
            },
            {"type": "text", "text": "Describe these two images."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

single video


from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "../Eagle2-8B/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)

inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

multieple videos

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "../Eagle2-8B/space_woaudio.mp4",
                "nframes": 10,
            },
            {
                "type": "video",
                "video": "../Eagle2-8B/video_ocr.mp4",
                "nframes": 10,
            },
            {"type": "text", "text": "Describe these two videos respectively."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

batch inference

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

messages2 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Citation

If you use Eagle 2.5 in your research, please cite:

@article{chen2025eagle2.5,
    title={Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models},
    author={Chen, Guo and Li, Zhiqi and Wang, Shihao and Jiang, Jindong and Liu, Yicheng and Lu, Lidong and Huang, De-An and Byeon, Wonmin and Le, Matthieu and Ehrlich, Max and Lu, Tong and Wang, Limin and Catanzaro, Bryan and Kautz, Jan and Tao, Andrew and Yu, Zhiding and Liu, Guilin},
    journal={arXiv:2504.15271},
year={2025}
}

Acknowledgements

We thank the contributors and collaborators for their valuable discussions and support, including NVIDIA infrastructure, legal and research teams.
InternVL: we built the codebase on top of InternVL. Thanks for the great open-source project.
VLMEvalKit: We use vlmeval for evaluation. Many thanks for the wonderful tools.
Thanks to Cambrian, LLaVA-One-Vision and more great work for their efforts in organizing open-source data.

License/Terms of Use

The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained model weights are released under the NSCLv1 license, where user may use for academic or non-profit research purposes only.
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of Qwen2.5-7B-Instruct: Apache-2.0
- Model License of siglip2-so400m-patch16-512: Apache-2.0
- Models are improved using Qwen.
- Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

nvidia
/

Eagle2.5-8B

You need to agree to share your contact information to access this model