LLaVA-UHD-v3: PROGRESSIVE VISUAL COMPRES-SION FOR EFFICIENT NAIVE-RESOLUTION ENCODING IN MLLMS

πŸš€ Github   |   πŸ“„ Arxiv

We release LLaVA-UHD-v3, a multimodal large language model (MLLM) built upon our proposed Progressive Visual Compression (PVC) for efficient naive-resolution encoding. Our model not only achieves performance comparable to advanced MLLMs like Qwen2-VL across 15 diverse benchmarks but also delivers a 1.9x reduction in Time-to-First-Token (TTFT). Moreover, LLaVA-UHD v3 can be trained efficiently in academic settings, requiring approximately 300 hours on 32 A100 GPUs.

Quick Start

pip install "transformers>=4.51.0"

We adapt our model with transformers, here we show a example of how to chat with our LLaVA-UHD-v3 conveniently.

Using πŸ€— Transformers to Chat

from transformers import AutoModelForCasualLM, AutoProcessor

# default: Load the model on the available device(s)
model = AutoModelForImageTextToText.from_pretrained(
    "Sishxo/LLaVA-UHD-v3", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
#     "Sishxo/LLaVA-UHD-v3",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("Sishxo/LLaVA-UHD-v3")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Highlights

The LLaVA-UHD-v3 framework

🧠 Progressive Visual Compression (PVC): LLaVA-UHD v3 introduces a novel visual encoding strategy for efficient naive-resolution processing in MLLMs, combining fine-grained tokenization with hierarchical compression.

πŸ’‘ Refined Patch Embedding (RPE): Flexibly scales patch sizes to produce detailed visual tokens while maintaining full compatibility with pretrained Vision Transformers, enabling richer visual representations.

πŸ’‘ Windowed Token Compression (WTC): Progressively merges local token representations within the vision encoder, reducing sequence length and computational cost without losing holistic visual context.

πŸ† Preserved Holistic Understanding: Unlike slice-based approaches, PVC maintains full-scene semantics, preventing fragmented interpretations common in other naive-resolution encoding methods.

⚑ Efficient and Competitive: Achieves strong performance on a wide range of vision-language benchmarks, rivaling state-of-the-art models like Qwen2-VL, while significantly lowering inference latency.

This repository provides examples, usage instructions, and best practices to help developers leverage LLaVA-UHD v3 for efficient, high-fidelity vision-language tasks.

Performance

LLaVA-UHD-v3

The LLaVA-UHD-v3 Performance

The LLaVA-UHD-v3 Performance Figure 1

The LLaVA-UHD-v3 Performance Figure 2

ViT-UHD

The ViT-UHD Performance The ViT-UHD Performance

Citation

If you find LLaVA-UHD-v3 useful for your research and applications, please cite using this BibTeX:

@misc{sun2025llavauhdv3progressivevisual,
      title={LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs}, 
      author={Shichu Sun and Yichen Zhang and Haolin Song and Zonghao Guo and Chi Chen and Yidan Zhang and Yuan Yao and Zhiyuan Liu and Maosong Sun},
      year={2025},
      eprint={2511.21150},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21150}, 
}
@article{zhang2024llavauhdv2,
  title={LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer},
  author={Yipeng Zhang and Yifan Liu and Zonghao Guo and Yidan Zhang and Xuesong Yang and Chi Chen and Jun Song and Bo Zheng and Yuan Yao and Zhiyuan Liu and Tat-Seng Chua and Maosong Sun},
  journal={arXiv preprint arXiv:2412.13871},
  year={2024}
}
@inproceedings{guo2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  booktitle={ECCV},
  year={2024}
}
Downloads last month
18
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support