Papers
arxiv:2511.16334

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Published on Nov 20
· Submitted by kcz on Nov 24
#1 Paper of the day
Authors:
,
,
,
,
,

Abstract

OpenMMReasoner, a two-stage training approach combining supervised fine-tuning and reinforcement learning, enhances multimodal reasoning performance through rigorous data curation and improved training strategies.

AI-generated summary

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

Community

Paper author Paper submitter

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research.

In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

Models : https://huggingface.co/OpenMMReasoner/OpenMMReasoner-RL
Collection : https://huggingface.co/collections/lmms-lab/openmmreasoner
Paper : https://arxiv.org/abs/2511.16334
Project Page : https://evolvinglmms-lab.github.io/OpenMMReasoner/
Github : https://github.com/EvolvingLMMs-Lab/OpenMMReasoner
Blog : https://www.lmms-lab.com/posts/openmmreasoner/

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.16334 in a Space README.md to link it from this page.

Collections including this paper 5