image/png

πŸš€ Can we cast reward modeling as a reasoning task?

RM-R1 is a training framework for Reasoning Reward Model (ReasRM) that judges two candidate answers by first thinking out loudβ€”generating rubrics or reasoning tracesβ€”then emitting its preference.
Compared with prior scalar or vanilla generative reward models, RM-R1 delivers up to +13.8 % absolute accuracy gains on public reward model benchmarks while providing fully interpretable critiques.

TL;DR

  • Two-stage training

    1. Distillation of ~8.7 K high-quality reasoning traces (Chain-of-Rubrics).
    2. Reinforcement Learning with Verifiable Rewards (RLVR) on ~64 K preference pairs.
  • Backbones released: 7 B / 14 B / 32 B Qwen-2.5-Instruct variants + DeepSeek-distilled checkpoints.

Intended uses

  • RLHF / RLAIF: plug-and-play reward function for policy optimisation.
  • Automated evaluation: LLM-as-a-judge for open-domain QA, chat, and reasoning.
  • Research: study process supervision, chain-of-thought verification, or rubric generation.
Downloads last month
14
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gaotang/RM-R1-Qwen2.5-Instruct-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(2165)
this model
Quantizations
2 models

Collection including gaotang/RM-R1-Qwen2.5-Instruct-7B