Sicong nielsr HF Staff commited on
Commit
568e8e5
·
verified ·
1 Parent(s): a3b7983

Add or improve model card for MMR1: add metadata, GitHub link, and comprehensive description (#1)

Browse files

- Add or improve model card for MMR1: add metadata, GitHub link, and comprehensive description (b9a0862d12e1b34134d28d7e7377dfd641da31e8)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +123 -1
README.md CHANGED
@@ -1,2 +1,124 @@
 
 
 
 
 
 
1
  [![arXiv](https://img.shields.io/badge/arXiv-2509.21268-b31b1b.svg)](https://arxiv.org/abs/2509.21268)
2
- [![Hugging Face](https://img.shields.io/badge/HuggingFace-MMR1-FFAE1A)](https://huggingface.co/papers/2509.21268)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
  [![arXiv](https://img.shields.io/badge/arXiv-2509.21268-b31b1b.svg)](https://arxiv.org/abs/2509.21268)
8
+ [![Hugging Face](https://img.shields.io/badge/HuggingFace-MMR1-FFAE1A)](https://huggingface.co/papers/2509.21268)
9
+
10
+ <p align="center">
11
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/logo.png?raw=true" width="150" style="margin-bottom: 0.2;"/>
12
+ </p>
13
+
14
+ # MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
15
+
16
+ **Paper:** [MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources](https://huggingface.co/papers/2509.21268)
17
+ **GitHub Repository:** [https://github.com/LengSicong/MMR1](https://github.com/LengSicong/MMR1)
18
+
19
+ ## Introduction
20
+
21
+ MMR1 addresses key limitations in large multimodal reasoning models, specifically the scarcity of high-quality, large-scale, long chain-of-thought (CoT) data and the instability of reinforcement learning (RL) algorithms during post-training. The standard Group Relative Policy Optimization (GRPO) framework often suffers from gradient vanishing when reward variance is low, weakening optimization signals and impairing convergence.
22
+
23
+ This work makes three core contributions:
24
+
25
+ 1. **Variance-Aware Sampling (VAS):** A novel data selection strategy guided by a Variance Promotion Score (VPS). VAS combines outcome variance and trajectory diversity to promote reward variance, thereby stabilizing policy optimization and improving convergence in RL fine-tuning.
26
+ 2. **Large-scale curated resources:** Release of extensive, carefully curated datasets, including ~1.6M long CoT cold-start data and ~15k RL QA pairs. These resources are designed to ensure quality, difficulty, and diversity, and are accompanied by a fully reproducible end-to-end training codebase.
27
+ 3. **Open-source models:** A family of multimodal reasoning models released at multiple scales (MMR1-3B, MMR1-7B, and MMR1-32B), establishing standardized baselines for the community.
28
+
29
+ Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS, establishing new state-of-the-art performance for models of comparable scale.
30
+
31
+ ## Methodology Overview
32
+
33
+ The MMR1 framework introduces **Variance-Aware Sampling (VAS)** to tackle the *gradient vanishing problem* in reinforcement learning with Group Relative Policy Optimization (GRPO).
34
+
35
+ <p align="center">
36
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/fig1.png?raw=true" alt="Overview of the VAS framework" width="700"/>
37
+ </p>
38
+
39
+ As illustrated in **Figure 1**, training begins with a pool of prompts, where a random sampler ensures uniform coverage and a weighted sampler, guided by Variance Promotion Score (VPS), prioritizes prompts with higher reward variance and trajectory diversity. These two sources are combined to form training batches, balancing exploration and coverage. VPS scores are periodically re-estimated as the model improves, ensuring dynamic adaptation.
40
+
41
+ <p align="center">
42
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/algo1.png?raw=true" alt="algo" width="700"/>
43
+ </p>
44
+
45
+ **Algorithm 1** details the step-by-step integration of VAS within the GRPO framework, highlighting how it adaptively steers training towards prompts with higher reward variance, effectively stabilizing optimization and amplifying gradient signals, enabling more efficient and robust learning.
46
+
47
+ ## Open Resources
48
+
49
+ The project provides the following resources for the community:
50
+
51
+ - **[MMR1-SFT](https://huggingface.co/datasets/MMR1/MMR1-SFT) (~16M):** Supervised fine-tuning dataset with 16M long CoT cold-start trajectories (Gemini2.5 Pro/Flash) with verified short answer (GPT-4o)
52
+ - **[MMR1-RL](https://huggingface.co/datasets/MMR1/MMR1-RL) (15k):** RL dataset with 15k question-answer pairs (GPT-4o)
53
+ - **[MMR1-3B-SFT](https://huggingface.co/MMR1/MMR1-3B-SFT):** 3B checkpoint trained with MMR1-SFT
54
+ - **[MMR1-3B-RL](https://huggingface.co/MMR1/MMR1-3B-RL):** 3B checkpoint trained with MMR1-SFT and MMR1-RL
55
+ - **[MMR1-7B-SFT](https://huggingface.co/MMR1/MMR1-7B-SFT):** 7B checkpoint trained with MMR1-SFT
56
+ - **[MMR1-7B-RL](https://huggingface.co/MMR1/MMR1-7B-RL):** 7B checkpoint trained with MMR1-SFT and MMR1-RL
57
+ - **[MMR1-32B-SFT](https://huggingface.co/MMR1/MMR1-32B-SFT):** 32B checkpoint trained with MMR1-SFT
58
+ - **[MMR1-32B-RL](https://huggingface.co/MMR1/MMR1-32B-RL):** 32B checkpoint trained with MMR1-SFT and MMR1-RL (On the way!)
59
+
60
+ <p align="center">
61
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/data.png?raw=true" alt="data" width="700"/>
62
+ </p>
63
+
64
+ This dataset spans diverse domains—including mathematics, science, charts/figures, document tables, and general understanding—covering ~1.6M math samples and an additional ~37K samples across other domains. It integrates existing public resources (e.g., MathVerse, ScienceQA, ChartQA, DocVQA, GQA) together with newly curated and self-collected data, ensuring quality, difficulty, and diversity.
65
+
66
+ ## Evaluation Results
67
+
68
+ MMR1 models were evaluated on a suite of **mathematics-related multimodal reasoning benchmarks** (MathVerse, MathVista, MathVision, LogicVista, and ChartQA).
69
+
70
+ <p align="center">
71
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/result.png?raw=true" alt="result" width="700"/>
72
+ </p>
73
+
74
+ - **MMR1-7B-RL** achieves an average score of **58.4**, establishing new state-of-the-art performance among 7B-scale reasoning models.
75
+ - **MMR1-3B-RL** performs competitively with **52.7**, showing strong reasoning ability even at smaller scale.
76
+ Our models consistently outperform or match larger baselines, demonstrating the effectiveness of **Variance-Aware Sampling (VAS)** and our curated **long CoT training data**.
77
+
78
+ ## Analysis of VAS Training Dynamics
79
+
80
+ Further analysis highlights the effectiveness of **Variance-Aware Sampling (VAS)** through training efficiency and the evolution of **Variance Promotion Score (VPS)**.
81
+
82
+ <p align="center">
83
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/anal1.png?raw=true" alt="anal1" width="700"/>
84
+ </p>
85
+
86
+ **Training Efficiency (Fig. 2).** VAS substantially amplifies gradient magnitudes, mitigating the gradient vanishing issue, and consistently provides stronger optimization signals. Higher clipping fractions suggest that policy updates are closer to the trust-region boundary, allowing more effective utilization of the learning signal. Both full VAS and mixed VAS–random sampling strategies converge faster and achieve higher final accuracy than the baseline, improving both efficiency and performance.
87
+
88
+ <p align="center">
89
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/anal2.png?raw=true" alt="anal2" width="700"/>
90
+ </p>
91
+
92
+ **VPS Dynamics (Fig. 3).** VPS distributions evolve from relatively uniform to more concentrated, indicating convergence in identifying consistently informative prompts. Dynamic reweighting ensures the model continually prioritizes prompts with higher reward variance while adapting as learning progresses, preventing overfitting to a static subset of data.
93
+
94
+ 👉 Together, these analyses highlight how **VAS effectively mitigates gradient vanishing, improves sample efficiency, and adapts dynamically to the evolving training landscape.**
95
+
96
+ ## Qualitative Demo
97
+
98
+ To illustrate the reasoning capability of our models, we provide qualitative examples from **MathVerse**.
99
+
100
+ <p align="center">
101
+ <img src="https://github.com/LengSicong/MMR1/blob/main/assets/demo.png?raw=true" alt="demo" width="700"/>
102
+ </p>
103
+
104
+ The demo showcases how the model carefully analyzes the problem, plans a structured solution, executes step-by-step reasoning, verifies results, and even provides alternative solution paths. This demonstrates the model’s ability to maintain logical consistency, perform reflective verification, and present human-readable reasoning traces.
105
+
106
+ ## Citation
107
+
108
+ If you find MMR1 useful for your research and applications, please cite using this BibTeX:
109
+
110
+ ```bibtex
111
+ @misc{leng2025mmr1,
112
+ title={MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources},
113
+ author={Sicong Leng and Jing Wang and Jiaxi Li and Hao Zhang and Zhiqiang Hu and Boqiang Zhang and Yuming Jiang and Hang Zhang and Xin Li and Lidong Bing and Deli Zhao and Wei Lu and Yu Rong and Aixin Sun and Shijian Lu},
114
+ year={2025},
115
+ eprint={2509.21268},
116
+ archivePrefix={arXiv},
117
+ primaryClass={cs.CV},
118
+ url={https://arxiv.org/abs/2509.21268},
119
+ }
120
+ ```
121
+
122
+ ## License
123
+
124
+ This project is released under the Apache 2.0 license as found in the LICENSE file.