ilgee commited on
Commit
a4cb3ef
·
verified ·
1 Parent(s): 041d596

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -9
README.md CHANGED
@@ -9,9 +9,9 @@ tags:
9
  - preference-learning
10
  ---
11
 
12
- # Multiclass-Think-RM
13
 
14
- Multiclass-Think-RM is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265).
15
 
16
  This model is fine-tuned from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max).
17
 
@@ -83,13 +83,7 @@ message = tokenizer.apply_chat_template(
83
 
84
  ## Performance
85
 
86
- Multiclass-Think-RM demonstrates significant improvements over baseline reward models:
87
-
88
- - **RewardBench**: Up to 5% average improvement, with strong performance on Chat Hard and Reasoning subcategories
89
- - **RM-Bench**: Up to 8% average improvement, with substantial gains in the Math domain
90
- - **HelpSteer3-Preference**: Strong performance on this reasoning-heavy code domain
91
- - Strong generalization to out-of-distribution tasks
92
- - Provides fine-grained preference strength signals compared to binary models
93
 
94
  ## Citation
95
 
 
9
  - preference-learning
10
  ---
11
 
12
+ # Multiclass-Think-RM-8B
13
 
14
+ Multiclass-Think-RM-8B is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265).
15
 
16
  This model is fine-tuned from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max).
17
 
 
83
 
84
  ## Performance
85
 
86
+ For detailed performance metrics on RewardBench, RM-Bench, HelpSteer2-Preference, and HelpSteer3-Preference, please refer to Tables 1, 2, and 3 in the paper: [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265)
 
 
 
 
 
 
87
 
88
  ## Citation
89