R1-VL-7B is a reasoning model trained with step-wise group relative policy optimization (StepGRPO).
Base model