VGR: Visual Grounded Reasoning

Overview

This is the home page for VGR (Visual Grounded Reasoning): a novel multimodal large language model (MLLM) designed to enhance fine-grained visual perception and reasoning capabilities. Unlike traditional MLLMs, VGR enables selective attention to visual regions during inference, improving accuracy in complex visual reasoning tasks. It introduces a self-driven selective visual replay mechanism and is trained on a large-scale dataset (VGR-SFT) that integrates visual grounding and language deduction.

Key Features

Selective Visual Replay: Dynamically retrieves visual features from specific regions during reasoning.
Visual Grounding: Explicitly models visual region attention in multimodal reasoning.
Efficient Token Usage: Uses 30% fewer image tokens compared to baselines while improving performance.
Superior Performance: Outperforms LLaVA-NeXT on benchmarks like MMStar (+4.1), AI2D (+7.1), and ChartQA (+12.9).

Dataset

VGR is trained on VGR-SFT, a large-scale dataset containing 158.1k samples across various domains:

ScienceQA (AI2D: 12.5k)
General VQA (GQA: 39.2k, LLaVA-COCO: 12.3k)
OCR-based tasks (ChartQA: 11.2k, DocVQA: 6.0k, etc.)

Data have been make Public Avaliable: checkout at VGR-SFT!

Citation

@article{wang2025vgr,
  title={VGR: Visual Grounded Reasoning}, 
  author={Jiacong Wang and Zijian Kang and Haochen Wang and Haiyong Jiang and Jiawen Li and Bohong Wu and Ya Wang and Jiao Ran and Xiao Liang and Chao Feng and Jun Xiao},
  journal={arXiv preprint arXiv:2506.11991},
  year={2025}
}