R1-VL-2B is a reasoning model trained with step-wise group relative policy optimization (StepGRPO).
Base model