MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Paper: MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources GitHub Repository: https://github.com/LengSicong/MMR1

Introduction

MMR1 addresses key limitations in large multimodal reasoning models, specifically the scarcity of high-quality, large-scale, long chain-of-thought (CoT) data and the instability of reinforcement learning (RL) algorithms during post-training. The standard Group Relative Policy Optimization (GRPO) framework often suffers from gradient vanishing when reward variance is low, weakening optimization signals and impairing convergence.

This work makes three core contributions:

Variance-Aware Sampling (VAS): A novel data selection strategy guided by a Variance Promotion Score (VPS). VAS combines outcome variance and trajectory diversity to promote reward variance, thereby stabilizing policy optimization and improving convergence in RL fine-tuning.
Large-scale curated resources: Release of extensive, carefully curated datasets, including ~1.6M long CoT cold-start data and ~15k RL QA pairs. These resources are designed to ensure quality, difficulty, and diversity, and are accompanied by a fully reproducible end-to-end training codebase.
Open-source models: A family of multimodal reasoning models released at multiple scales (MMR1-3B, MMR1-7B, and MMR1-32B), establishing standardized baselines for the community.

Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS, establishing new state-of-the-art performance for models of comparable scale.

Methodology Overview

The MMR1 framework introduces Variance-Aware Sampling (VAS) to tackle the gradient vanishing problem in reinforcement learning with Group Relative Policy Optimization (GRPO).

Overview of the VAS framework

As illustrated in Figure 1, training begins with a pool of prompts, where a random sampler ensures uniform coverage and a weighted sampler, guided by Variance Promotion Score (VPS), prioritizes prompts with higher reward variance and trajectory diversity. These two sources are combined to form training batches, balancing exploration and coverage. VPS scores are periodically re-estimated as the model improves, ensuring dynamic adaptation.

algo

Algorithm 1 details the step-by-step integration of VAS within the GRPO framework, highlighting how it adaptively steers training towards prompts with higher reward variance, effectively stabilizing optimization and amplifying gradient signals, enabling more efficient and robust learning.

Open Resources

The project provides the following resources for the community:

MMR1-SFT (~16M): Supervised fine-tuning dataset with 16M long CoT cold-start trajectories (Gemini2.5 Pro/Flash) with verified short answer (GPT-4o)
MMR1-RL (15k): RL dataset with 15k question-answer pairs (GPT-4o)
MMR1-3B-SFT: 3B checkpoint trained with MMR1-SFT
MMR1-3B-RL: 3B checkpoint trained with MMR1-SFT and MMR1-RL
MMR1-7B-SFT: 7B checkpoint trained with MMR1-SFT
MMR1-7B-RL: 7B checkpoint trained with MMR1-SFT and MMR1-RL
MMR1-32B-SFT: 32B checkpoint trained with MMR1-SFT
MMR1-32B-RL: 32B checkpoint trained with MMR1-SFT and MMR1-RL (On the way!)

data

This dataset spans diverse domains—including mathematics, science, charts/figures, document tables, and general understanding—covering ~1.6M math samples and an additional ~37K samples across other domains. It integrates existing public resources (e.g., MathVerse, ScienceQA, ChartQA, DocVQA, GQA) together with newly curated and self-collected data, ensuring quality, difficulty, and diversity.

Evaluation Results

MMR1 models were evaluated on a suite of mathematics-related multimodal reasoning benchmarks (MathVerse, MathVista, MathVision, LogicVista, and ChartQA).

result

MMR1-7B-RL achieves an average score of 58.4, establishing new state-of-the-art performance among 7B-scale reasoning models.
MMR1-3B-RL performs competitively with 52.7, showing strong reasoning ability even at smaller scale. Our models consistently outperform or match larger baselines, demonstrating the effectiveness of Variance-Aware Sampling (VAS) and our curated long CoT training data.

Analysis of VAS Training Dynamics

Further analysis highlights the effectiveness of Variance-Aware Sampling (VAS) through training efficiency and the evolution of Variance Promotion Score (VPS).

anal1

Training Efficiency (Fig. 2). VAS substantially amplifies gradient magnitudes, mitigating the gradient vanishing issue, and consistently provides stronger optimization signals. Higher clipping fractions suggest that policy updates are closer to the trust-region boundary, allowing more effective utilization of the learning signal. Both full VAS and mixed VAS–random sampling strategies converge faster and achieve higher final accuracy than the baseline, improving both efficiency and performance.

anal2

VPS Dynamics (Fig. 3). VPS distributions evolve from relatively uniform to more concentrated, indicating convergence in identifying consistently informative prompts. Dynamic reweighting ensures the model continually prioritizes prompts with higher reward variance while adapting as learning progresses, preventing overfitting to a static subset of data.

👉 Together, these analyses highlight how VAS effectively mitigates gradient vanishing, improves sample efficiency, and adapts dynamically to the evolving training landscape.

Qualitative Demo

To illustrate the reasoning capability of our models, we provide qualitative examples from MathVerse.

demo

The demo showcases how the model carefully analyzes the problem, plans a structured solution, executes step-by-step reasoning, verifies results, and even provides alternative solution paths. This demonstrates the model’s ability to maintain logical consistency, perform reflective verification, and present human-readable reasoning traces.

Citation

If you find MMR1 useful for your research and applications, please cite using this BibTeX:

@misc{leng2025mmr1,
  title={MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources},
  author={Sicong Leng and Jing Wang and Jiaxi Li and Hao Zhang and Zhiqiang Hu and Boqiang Zhang and Yuming Jiang and Hang Zhang and Xin Li and Lidong Bing and Deli Zhao and Wei Lu and Yu Rong and Aixin Sun and Shijian Lu},
  year={2025},
  eprint={2509.21268},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2509.21268},
}

License

This project is released under the Apache 2.0 license as found in the LICENSE file.

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for MMR1/MMR1-3B-SFT

Quantizations

1 model