MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Abstract
Variance-Aware Sampling and large-scale CoT data improve multimodal reasoning models by stabilizing RL fine-tuning and enhancing performance on benchmarks.
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
Community
We release the following resources for the community: https://github.com/LengSicong/MMR1
MMR1-SFT (~16M): Supervised fine-tuning dataset with 16M long CoT cold-start trajectories (Gemini2.5 Pro/Flash) with verified short answer (GPT-4o)
MMR1-RL (15k): RL dataset with 15k question-answer pairs (GPT-4o)
MMR1-3B-SFT: 3B checkpoint trained with MMR1-SFT
MMR1-3B-RL: 3B checkpoint trained with MMR1-SFT and MMR1-RL
MMR1-7B-SFT: 7B checkpoint trained with MMR1-SFT
MMR1-7B-RL: 7B checkpoint trained with MMR1-SFT and MMR1-RL
MMR1-32B-SFT: 32B checkpoint trained with MMR1-SFT
MMR1-32B-RL: 32B checkpoint trained with MMR1-SFT and MMR1-RL (On the way!)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning (2025)
- Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle (2025)
- Can GRPO Boost Complex Multimodal Table Understanding? (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models (2025)
- A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models (2025)
- Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper