VisionSelector-code

[πŸ“‚ VisionSelector]

VisionSelector-model

[πŸ€— VisionSelector-Qwen2.5-VL-3B] [πŸ€— VisionSelector-Qwen2.5-VL-7B] [πŸ€— VisionSelector-LLaVA-OV-1.5-8B]

Model Overview

We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.

Our key technical innovations include:

  • A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
  • A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
  • A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.

VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73Γ— prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.

Institution

  • University of Science and Technology of China
  • ZTE-AIM

Model Contact

Downloads last month
38
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JulietChoo/VisionSelector-Qwen2.5-VL-7B

Finetuned
(785)
this model

Dataset used to train JulietChoo/VisionSelector-Qwen2.5-VL-7B