VisionSelector-code

VisionSelector-model

[🤗 VisionSelector-Qwen2.5-VL-3B] [🤗 VisionSelector-Qwen2.5-VL-7B] [🤗 VisionSelector-LLaVA-OV-1.5-8B]

Model Overview

We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.

Our key technical innovations include:

A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.

VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.

Institution

University of Science and Technology of China
ZTE-AIM

Model Contact

[email protected]

Downloads last month: 38

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for JulietChoo/VisionSelector-Qwen2.5-VL-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(785)

this model

JulietChoo
/

VisionSelector-Qwen2.5-VL-7B