MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Paper: MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources GitHub Repository: https://github.com/LengSicong/MMR1
Introduction
MMR1 addresses key limitations in large multimodal reasoning models, specifically the scarcity of high-quality, large-scale, long chain-of-thought (CoT) data and the instability of reinforcement learning (RL) algorithms during post-training. The standard Group Relative Policy Optimization (GRPO) framework often suffers from gradient vanishing when reward variance is low, weakening optimization signals and impairing convergence.
This work makes three core contributions:
- Variance-Aware Sampling (VAS): A novel data selection strategy guided by a Variance Promotion Score (VPS). VAS combines outcome variance and trajectory diversity to promote reward variance, thereby stabilizing policy optimization and improving convergence in RL fine-tuning.
- Large-scale curated resources: Release of extensive, carefully curated datasets, including ~1.6M long CoT cold-start data and ~15k RL QA pairs. These resources are designed to ensure quality, difficulty, and diversity, and are accompanied by a fully reproducible end-to-end training codebase.
- Open-source models: A family of multimodal reasoning models released at multiple scales (MMR1-3B, MMR1-7B, and MMR1-32B), establishing standardized baselines for the community.
Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS, establishing new state-of-the-art performance for models of comparable scale.
Methodology Overview
The MMR1 framework introduces Variance-Aware Sampling (VAS) to tackle the gradient vanishing problem in reinforcement learning with Group Relative Policy Optimization (GRPO).
As illustrated in Figure 1, training begins with a pool of prompts, where a random sampler ensures uniform coverage and a weighted sampler, guided by Variance Promotion Score (VPS), prioritizes prompts with higher reward variance and trajectory diversity. These two sources are combined to form training batches, balancing exploration and coverage. VPS scores are periodically re-estimated as the model improves, ensuring dynamic adaptation.
Algorithm 1 details the step-by-step integration of VAS within the GRPO framework, highlighting how it adaptively steers training towards prompts with higher reward variance, effectively stabilizing optimization and amplifying gradient signals, enabling more efficient and robust learning.
Open Resources
The project provides the following resources for the community:
- MMR1-SFT (~16M): Supervised fine-tuning dataset with 16M long CoT cold-start trajectories (Gemini2.5 Pro/Flash) with verified short answer (GPT-4o)
- MMR1-RL (15k): RL dataset with 15k question-answer pairs (GPT-4o)
- MMR1-3B-SFT: 3B checkpoint trained with MMR1-SFT
- MMR1-3B-RL: 3B checkpoint trained with MMR1-SFT and MMR1-RL
- MMR1-7B-SFT: 7B checkpoint trained with MMR1-SFT
- MMR1-7B-RL: 7B checkpoint trained with MMR1-SFT and MMR1-RL
- MMR1-32B-SFT: 32B checkpoint trained with MMR1-SFT
- MMR1-32B-RL: 32B checkpoint trained with MMR1-SFT and MMR1-RL (On the way!)
This dataset spans diverse domains—including mathematics, science, charts/figures, document tables, and general understanding—covering ~1.6M math samples and an additional ~37K samples across other domains. It integrates existing public resources (e.g., MathVerse, ScienceQA, ChartQA, DocVQA, GQA) together with newly curated and self-collected data, ensuring quality, difficulty, and diversity.
Evaluation Results
MMR1 models were evaluated on a suite of mathematics-related multimodal reasoning benchmarks (MathVerse, MathVista, MathVision, LogicVista, and ChartQA).
- MMR1-7B-RL achieves an average score of 58.4, establishing new state-of-the-art performance among 7B-scale reasoning models.
- MMR1-3B-RL performs competitively with 52.7, showing strong reasoning ability even at smaller scale. Our models consistently outperform or match larger baselines, demonstrating the effectiveness of Variance-Aware Sampling (VAS) and our curated long CoT training data.
Analysis of VAS Training Dynamics
Further analysis highlights the effectiveness of Variance-Aware Sampling (VAS) through training efficiency and the evolution of Variance Promotion Score (VPS).
Training Efficiency (Fig. 2). VAS substantially amplifies gradient magnitudes, mitigating the gradient vanishing issue, and consistently provides stronger optimization signals. Higher clipping fractions suggest that policy updates are closer to the trust-region boundary, allowing more effective utilization of the learning signal. Both full VAS and mixed VAS–random sampling strategies converge faster and achieve higher final accuracy than the baseline, improving both efficiency and performance.
VPS Dynamics (Fig. 3). VPS distributions evolve from relatively uniform to more concentrated, indicating convergence in identifying consistently informative prompts. Dynamic reweighting ensures the model continually prioritizes prompts with higher reward variance while adapting as learning progresses, preventing overfitting to a static subset of data.
👉 Together, these analyses highlight how VAS effectively mitigates gradient vanishing, improves sample efficiency, and adapts dynamically to the evolving training landscape.
Qualitative Demo
To illustrate the reasoning capability of our models, we provide qualitative examples from MathVerse.
The demo showcases how the model carefully analyzes the problem, plans a structured solution, executes step-by-step reasoning, verifies results, and even provides alternative solution paths. This demonstrates the model’s ability to maintain logical consistency, perform reflective verification, and present human-readable reasoning traces.
Citation
If you find MMR1 useful for your research and applications, please cite using this BibTeX:
@misc{leng2025mmr1,
title={MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources},
author={Sicong Leng and Jing Wang and Jiaxi Li and Hao Zhang and Zhiqiang Hu and Boqiang Zhang and Yuming Jiang and Hang Zhang and Xin Li and Lidong Bing and Deli Zhao and Wei Lu and Yu Rong and Aixin Sun and Shijian Lu},
year={2025},
eprint={2509.21268},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.21268},
}
License
This project is released under the Apache 2.0 license as found in the LICENSE file.
- Downloads last month
- 29