DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
Abstract
DreamPRM-1.5, an instance-reweighted framework using bi-level optimization, improves multimodal process reward model training by addressing distribution shifts and noisy data, achieving high accuracy on the MMMU benchmark.
Training multimodal process reward models (PRMs) is challenged by distribution shifts and noisy data. We introduce DreamPRM-1.5, an instance-reweighted framework that adaptively adjusts the importance of each training example via bi-level optimization. We design two complementary strategies: Instance Table, effective for smaller datasets, and Instance Net, scalable to larger ones. Integrated into test-time scaling, DreamPRM-1.5 achieves 84.6 accuracy on the MMMU benchmark, surpassing GPT-5.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper