Abstract
VGR, a novel multimodal large language model, improves visual reasoning by detecting relevant image regions and integrating them into the reasoning process, outperforming existing models on multimodal benchmarks with reduced resource usage.
In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.
Community
This is our exploratory work on multi-modal reasoning, a subset of the sft data has been available. We welcome discussions on this domain. If you have any questions, please feel free to engage with us.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning (2025)
- Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning (2025)
- RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought (2025)
- Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL (2025)
- Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward (2025)
- Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing (2025)
- Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper