Model Summary

Qwen-VL-PRM-3B is a process reward model finetuned from Qwen2.5-3B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.

Logs: https://wandb.ai/aisg-arf/multimodal-reasoning/runs/pnsncs80
Repository: https://github.com/theogbrand/vlprm
Paper: https://arxiv.org/pdf/2509.23250

Use

The model usage is documented here.

Evaluation

Commercial Models

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
GPT-4o	70.7	60.0	57.8	30.9	31.2	50.1
o1	78.2	78.9	54.4	73.9	60.3	69.1
o3	82.9	84.1	62.3	86.8	--	--

Qwen-2.5-VL Family

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
Qwen-2.5-VL-3B	51.7	34.5	25.7	60.0	21.2	38.6
+ VL-PRM-7B	53.7 (+2.0)	44.9 (+10.5)	28.3 (+2.6)	64.1 (+4.1)	21.8 (+0.6)	42.6 (+4.0)
Qwen-2.5-VL-7B	55.0	48.0	29.1	67.8	24.2	44.8
+ VL-PRM-3B	57.6 (+2.6)	55.5 (+7.5)	33.8 (+4.7)	70.0 (+2.2)	26.1 (+1.9)	48.6 (+3.6)
+ VL-PRM-7B	57.4 (+2.4)	54.8 (+6.8)	35.3 (+6.2)	71.0 (+3.2)	26.2 (+2.0)	48.9 (+4.1)
Qwen-2.5-VL-32B	66.0	46.2	26.9	76.9	36.7	50.5
+ VL-PRM-3B	67.0 (+1.0)	67.1 (+20.8)	41.6 (+14.7)	77.7 (+0.8)	40.5 (+3.8)	58.7 (+8.2)
+ VL-PRM-7B	67.6 (+1.6)	66.8 (+20.6)	44.2 (+17.3)	78.3 (+1.4)	40.1 (+3.2)	59.4 (+8.9)

Gemma-3 Family

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
Gemma-3-12B	57.6	45.0	29.1	58.9	28.1	43.7
+ VL-PRM-3B	60.4 (+2.8)	57.7 (+12.7)	39.7 (+10.6)	60.3 (+1.4)	33.8 (+5.7)	50.4 (+6.7)
+ VL-PRM-7B	60.2 (+2.6)	59.0 (+12.0)	41.1 (+4.5)	63.3 (+4.4)	33.9 (+5.8)	51.5 (+7.8)
Gemma-3-27B	62.9	50.8	29.9	61.6	32.4	47.5
+ VL-PRM-3B	65.5 (+2.6)	67.4 (+16.6)	40.3 (+10.4)	65.4 (+3.8)	39.8 (+7.4)	55.7 (+8.2)
+ VL-PRM-7B	64.5 (+1.6)	67.6 (+16.8)	41.1 (+11.2)	65.2 (+3.6)	40.9 (+8.5)	55.9 (+8.4)

Framework versions

TRL: 0.19.1
Transformers: 4.55.3
Pytorch: 2.7.1
Datasets: 3.0.1
Tokenizers: 0.21.4

Citations

@misc{ong2025vlprms,
      title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned}, 
      author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
      year={2025},
      eprint={2509.23250},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/pdf/2509.23250}, 
}

Downloads last month: 82

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ob11/Qwen-VL-PRM-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(564)

this model

Quantizations

2 models

ob11
/

Qwen-VL-PRM-3B

Model Summary

Use

Evaluation

Commercial Models

Qwen-2.5-VL Family

Gemma-3 Family

Framework versions

Citations

Model tree for ob11/Qwen-VL-PRM-3B

Dataset used to train ob11/Qwen-VL-PRM-3B