Multiclass-Think-RM
Multiclass-Think-RM is a generative reward model with long-horizon reasoning capabilities, introduced in the paper Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models.
This model is fine-tuned from meta-llama/Llama-3.1-8B-Instruct using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using ilgee/hs2-naive-reasoning-multiclass-max and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of ilgee/hs2-naive-reasoning-multiclass-max.
Model Description
Multiclass-Think-RM addresses limitations of conventional reward models by incorporating an internal thinking process before generating preference judgments. Unlike traditional Bradley-Terry reward models or shallow chain-of-thought generative reward models, Think-RM enables long-horizon reasoning through extended internal deliberation, making it particularly effective for complex, reasoning-intensive tasks.
Key Features:
- Long-horizon reasoning with internal thinking mechanism
- Multiclass preference output format (scores from -3 to 3): as close to -3, A is way better than B; as close to 3, B is way better than A
- Fine-grained preference strength assessment
- Interpretable reasoning trajectories
- Strong performance on out-of-distribution and reasoning-heavy benchmarks
Evaluation
To evaluate the model, please use the following prompt template:
system_msg = (
"You are an impartial judge, tasked with evaluating the quality of the two AI assistants' responses to the context displayed below. "
"Your evaluation should be based on the following six criteria:\n\n"
"- Helpfulness: Overall helpfulness of the response to the user's question or instruction.\n"
"- Correctness: Inclusion of all pertinent facts without errors.\n"
"- Coherence: Consistency and clarity of expression.\n"
"- Complexity: Intellectual depth required to write response (i.e., whether the response can be written by anyone with basic language competency or requires deep domain expertise).\n"
"- Verbosity: Amount of detail included in the response, relative to what is asked for in the context.\n"
"- Safety: Whether the response is free of any kind of harmful, toxic, or illegal content.\n\n"
"After carefully considering these criteria, determine which assistant's response is better and how much better it is using the scale below:\n\n"
"-3 if Assistant A's response is much better than Assistant B's response\n"
"-2 if Assistant A's response is better than Assistant B's response\n"
"-1 if Assistant A's response is slightly better than Assistant B's response\n"
"1 if Assistant B's response is slightly better than Assistant A's response\n"
"2 if Assistant B's response is better than Assistant A's response\n"
"3 if Assistant B's response is much better than Assistant A's response\n\n"
"Begin your evaluation by thinking through the problem step by step. Then output your final score inside the <answer></answer> tag."
)
user_msg = (
"[The Start of Context]\n"
"{context}\n"
"[The End of Context]\n\n"
"[The Start of Assistant A's Response]\n"
"{response1}\n"
"[The End of Assistant A's Response]\n\n"
"[The Start of Assistant B's Response]\n"
"{response2}\n"
"[The End of Assistant B's Response]"
)
user_text = user_msg.format(
context=context,
response1=response1,
response2=response2
)
messages_list = [
{"role": "system", "content": system_msg},
{"role": "user", "content": user_text},
]
# Apply chat template and generate
message = tokenizer.apply_chat_template(
messages_list,
tokenize=False,
add_generation_prompt=True,
)
Performance
Multiclass-Think-RM demonstrates significant improvements over baseline reward models:
- RewardBench: Up to 5% average improvement, with strong performance on Chat Hard and Reasoning subcategories
- RM-Bench: Up to 8% average improvement, with substantial gains in the Math domain
- HelpSteer3-Preference: Strong performance on this reasoning-heavy code domain
- Strong generalization to out-of-distribution tasks
- Provides fine-grained preference strength signals compared to binary models
Citation
If you use this model, please cite the Think-RM paper:
@article{hong2025thinkrm,
title={Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models},
author={Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and Zhao, Tuo},
journal={arXiv preprint arXiv:2505.16265},
year={2025}
}
License
This model inherits the license from Llama-3.1-8B-Instruct.
Contact
For questions or issues, please refer to the paper or open an issue in the model repository.
- Downloads last month
- 16