Stanford-Alpaca

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao/VisionThink-General

This model is trained via reinforcement learning using Senqiao/VisionThink-General-Train, demonstrating enhanced performance on general VQA tasks.

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [Paper]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Highlights

Stanford-Alpaca

Our VisionThink leverages reinforcement learning to autonomously learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on fine-grained benchmarks, such as those involving OCR-related tasks.
VisionThink improves performance on General VQA tasks while reducing visual tokens by 50%, achieving 102% of the original model’s performance across nine benchmarks.
VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into Efficient Reasoning Vision Language Models.

Video

Citation

If you find this project useful in your research, please consider citing:

This work is highly motivated by our previous effort on efficient VLMs, VisionZip, which explores token compression for faster inference.

@article{yang2025visionthink,
  title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
  author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
  journal={arXiv preprint arXiv:2507.13348},
  year={2025}
}
@article{yang2024visionzip,
  title={VisionZip: Longer is Better but Not Necessary in Vision Language Models},
  author={Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2412.04467},
  year={2024}
}

Downloads last month: 10

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Senqiao/VisionThink-General

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(760)

this model

Datasets used to train Senqiao/VisionThink-General

Collection including Senqiao/VisionThink-General

VisionThink

Collection

Efficient Reasoning Vision Language Model • 7 items • Updated Jul 18 • 7