Upload README.md with huggingface_hub
Browse files
    	
        README.md
    CHANGED
    
    | @@ -9,9 +9,9 @@ tags: | |
| 9 | 
             
            - preference-learning
         | 
| 10 | 
             
            ---
         | 
| 11 |  | 
| 12 | 
            -
            # Multiclass-Think-RM
         | 
| 13 |  | 
| 14 | 
            -
            Multiclass-Think-RM is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265).
         | 
| 15 |  | 
| 16 | 
             
            This model is fine-tuned from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max).
         | 
| 17 |  | 
| @@ -83,13 +83,7 @@ message = tokenizer.apply_chat_template( | |
| 83 |  | 
| 84 | 
             
            ## Performance
         | 
| 85 |  | 
| 86 | 
            -
             | 
| 87 | 
            -
             | 
| 88 | 
            -
            - **RewardBench**: Up to 5% average improvement, with strong performance on Chat Hard and Reasoning subcategories
         | 
| 89 | 
            -
            - **RM-Bench**: Up to 8% average improvement, with substantial gains in the Math domain
         | 
| 90 | 
            -
            - **HelpSteer3-Preference**: Strong performance on this reasoning-heavy code domain
         | 
| 91 | 
            -
            - Strong generalization to out-of-distribution tasks
         | 
| 92 | 
            -
            - Provides fine-grained preference strength signals compared to binary models
         | 
| 93 |  | 
| 94 | 
             
            ## Citation
         | 
| 95 |  | 
|  | |
| 9 | 
             
            - preference-learning
         | 
| 10 | 
             
            ---
         | 
| 11 |  | 
| 12 | 
            +
            # Multiclass-Think-RM-8B
         | 
| 13 |  | 
| 14 | 
            +
            Multiclass-Think-RM-8B is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265).
         | 
| 15 |  | 
| 16 | 
             
            This model is fine-tuned from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-multiclass-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-multiclass-max).
         | 
| 17 |  | 
|  | |
| 83 |  | 
| 84 | 
             
            ## Performance
         | 
| 85 |  | 
| 86 | 
            +
            For detailed performance metrics on RewardBench, RM-Bench, HelpSteer2-Preference, and HelpSteer3-Preference, please refer to Tables 1, 2, and 3 in the paper: [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://arxiv.org/abs/2505.16265)
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 87 |  | 
| 88 | 
             
            ## Citation
         | 
| 89 |  |