ilgee commited on
Commit
3bffae5
·
verified ·
1 Parent(s): b47cba8

Update model card

Browse files
Files changed (1) hide show
  1. README.md +119 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ tags:
6
+ - reward-model
7
+ - RLHF
8
+ - reasoning
9
+ - preference-learning
10
+ ---
11
+
12
+ # Multiclass-Think-RM
13
+
14
+ Multiclass-Think-RM is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265).
15
+
16
+ This model is fine-tuned from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max).
17
+
18
+ ## Model Description
19
+
20
+ Multiclass-Think-RM addresses limitations of conventional reward models by incorporating an internal thinking process before generating preference judgments. Unlike traditional Bradley-Terry reward models or shallow chain-of-thought generative reward models, Think-RM enables long-horizon reasoning through extended internal deliberation, making it particularly effective for complex, reasoning-intensive tasks.
21
+
22
+ **Key Features:**
23
+ - Long-horizon reasoning with internal thinking mechanism
24
+ - Multiclass preference output format (scores from -3 to 3): as close to -3, A is way better than B; as close to 3, B is way better than A
25
+ - Fine-grained preference strength assessment
26
+ - Interpretable reasoning trajectories
27
+ - Strong performance on out-of-distribution and reasoning-heavy benchmarks
28
+
29
+ ## Evaluation
30
+
31
+ To evaluate the model, please use the following prompt template:
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+
36
+ model_name = "ilgee/Multiclass-Think-RM-8B"
37
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
38
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
39
+
40
+ system_msg = (
41
+ "You are an impartial judge, tasked with evaluating the quality of the two AI assistants' responses to the context displayed below. "
42
+ "Your evaluation should be based on the following six criteria:\n\n"
43
+ "- Helpfulness: Overall helpfulness of the response to the user's question or instruction.\n"
44
+ "- Correctness: Inclusion of all pertinent facts without errors.\n"
45
+ "- Coherence: Consistency and clarity of expression.\n"
46
+ "- Complexity: Intellectual depth required to write response (i.e., whether the response can be written by anyone with basic language competency or requires deep domain expertise).\n"
47
+ "- Verbosity: Amount of detail included in the response, relative to what is asked for in the context.\n"
48
+ "- Safety: Whether the response is free of any kind of harmful, toxic, or illegal content.\n\n"
49
+ "After carefully considering these criteria, determine which assistant's response is better and how much better it is using the scale below:\n\n"
50
+ "-3 if Assistant A's response is much better than Assistant B's response\n"
51
+ "-2 if Assistant A's response is better than Assistant B's response\n"
52
+ "-1 if Assistant A's response is slightly better than Assistant B's response\n"
53
+ "1 if Assistant B's response is slightly better than Assistant A's response\n"
54
+ "2 if Assistant B's response is better than Assistant A's response\n"
55
+ "3 if Assistant B's response is much better than Assistant A's response\n\n"
56
+ "Begin your evaluation by thinking through the problem step by step. Then output your final score inside the <answer></answer> tag."
57
+ )
58
+
59
+ user_msg = (
60
+ "[The Start of Context]\n"
61
+ "{context}\n"
62
+ "[The End of Context]\n\n"
63
+ "[The Start of Assistant A's Response]\n"
64
+ "{response1}\n"
65
+ "[The End of Assistant A's Response]\n\n"
66
+ "[The Start of Assistant B's Response]\n"
67
+ "{response2}\n"
68
+ "[The End of Assistant B's Response]"
69
+ )
70
+
71
+ user_text = user_msg.format(
72
+ context=context,
73
+ response1=response1,
74
+ response2=response2
75
+ )
76
+
77
+ messages_list = [
78
+ {"role": "system", "content": system_msg},
79
+ {"role": "user", "content": user_text},
80
+ ]
81
+
82
+ # Apply chat template and generate
83
+ message = tokenizer.apply_chat_template(
84
+ messages_list,
85
+ tokenize=False,
86
+ add_generation_prompt=True,
87
+ )
88
+ ```
89
+
90
+ ## Performance
91
+
92
+ Multiclass-Think-RM demonstrates significant improvements over baseline reward models:
93
+
94
+ - **RewardBench**: Up to 5% average improvement, with strong performance on Chat Hard and Reasoning subcategories
95
+ - **RM-Bench**: Up to 8% average improvement, with substantial gains in the Math domain
96
+ - **HelpSteer3-Preference**: Strong performance on this reasoning-heavy code domain
97
+ - Strong generalization to out-of-distribution tasks
98
+ - Provides fine-grained preference strength signals compared to binary models
99
+
100
+ ## Citation
101
+
102
+ If you use this model, please cite the Think-RM paper:
103
+
104
+ ```bibtex
105
+ @article{hong2025thinkrm,
106
+ title={Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models},
107
+ author={Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and Zhao, Tuo},
108
+ journal={arXiv preprint arXiv:2505.16265},
109
+ year={2025}
110
+ }
111
+ ```
112
+
113
+ ## License
114
+
115
+ This model inherits the license from [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
116
+
117
+ ## Contact
118
+
119
+ For questions or issues, please refer to the paper or open an issue in the model repository.