Model Summary
Qwen-VL-PRM-3B is a process reward model finetuned from Qwen2.5-3B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.
Use
The model usage is documented here.
Evaluation
Commercial Models
Model |
MMMU |
PuzzleVQA |
AlgoPuzzleVQA |
MathVista |
MathVision |
Overall |
GPT-4o |
70.7 |
60.0 |
57.8 |
30.9 |
31.2 |
50.1 |
o1 |
78.2 |
78.9 |
54.4 |
73.9 |
60.3 |
69.1 |
o3 |
82.9 |
84.1 |
62.3 |
86.8 |
-- |
-- |
Qwen-2.5-VL Family
Model |
MMMU |
PuzzleVQA |
AlgoPuzzleVQA |
MathVista |
MathVision |
Overall |
Qwen-2.5-VL-3B |
51.7 |
34.5 |
25.7 |
60.0 |
21.2 |
38.6 |
+ VL-PRM-7B |
53.7 (+2.0) |
44.9 (+10.5) |
28.3 (+2.6) |
64.1 (+4.1) |
21.8 (+0.6) |
42.6 (+4.0) |
Qwen-2.5-VL-7B |
55.0 |
48.0 |
29.1 |
67.8 |
24.2 |
44.8 |
+ VL-PRM-3B |
57.6 (+2.6) |
55.5 (+7.5) |
33.8 (+4.7) |
70.0 (+2.2) |
26.1 (+1.9) |
48.6 (+3.6) |
+ VL-PRM-7B |
57.4 (+2.4) |
54.8 (+6.8) |
35.3 (+6.2) |
71.0 (+3.2) |
26.2 (+2.0) |
48.9 (+4.1) |
Qwen-2.5-VL-32B |
66.0 |
46.2 |
26.9 |
76.9 |
36.7 |
50.5 |
+ VL-PRM-3B |
67.0 (+1.0) |
67.1 (+20.8) |
41.6 (+14.7) |
77.7 (+0.8) |
40.5 (+3.8) |
58.7 (+8.2) |
+ VL-PRM-7B |
67.6 (+1.6) |
66.8 (+20.6) |
44.2 (+17.3) |
78.3 (+1.4) |
40.1 (+3.2) |
59.4 (+8.9) |
Gemma-3 Family
Model |
MMMU |
PuzzleVQA |
AlgoPuzzleVQA |
MathVista |
MathVision |
Overall |
Gemma-3-12B |
57.6 |
45.0 |
29.1 |
58.9 |
28.1 |
43.7 |
+ VL-PRM-3B |
60.4 (+2.8) |
57.7 (+12.7) |
39.7 (+10.6) |
60.3 (+1.4) |
33.8 (+5.7) |
50.4 (+6.7) |
+ VL-PRM-7B |
60.2 (+2.6) |
59.0 (+12.0) |
41.1 (+4.5) |
63.3 (+4.4) |
33.9 (+5.8) |
51.5 (+7.8) |
Gemma-3-27B |
62.9 |
50.8 |
29.9 |
61.6 |
32.4 |
47.5 |
+ VL-PRM-3B |
65.5 (+2.6) |
67.4 (+16.6) |
40.3 (+10.4) |
65.4 (+3.8) |
39.8 (+7.4) |
55.7 (+8.2) |
+ VL-PRM-7B |
64.5 (+1.6) |
67.6 (+16.8) |
41.1 (+11.2) |
65.2 (+3.6) |
40.9 (+8.5) |
55.9 (+8.4) |
Framework versions
- TRL: 0.19.1
- Transformers: 4.55.3
- Pytorch: 2.7.1
- Datasets: 3.0.1
- Tokenizers: 0.21.4
Citations
@misc{ong2025vlprms,
title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned},
author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
year={2025},
eprint={2509.23250},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/pdf/2509.23250},
}