Reward-Reasoning
/

RRM-7B

Safetensors

Model card Files Files and versions

xet

Community

Reward-Reasoning commited on May 21

Commit

5cb1c54

verified ·

1 Parent(s): 9c2ddb9

Update README.md

Browse files

Files changed (1) hide show

README.md +19 -0

README.md CHANGED Viewed

@@ -3,6 +3,11 @@
 <!-- markdownlint-disable html -->
 <!-- markdownlint-disable no-duplicate-header -->
 ## 1. Introduction
 We propose **Reward Reasoning Models (RRMs)**. Unlike existing reward models, RRMs frames reward modeling as a reasoning task, wherein the model first produces a long chain-of-thought reasoning process before generating the final rewards.
@@ -190,3 +195,17 @@ The generated text:
 Alright, so I need to figure out which assistant, Assistant 1 or Assistant 2, is better for the given instruction. The instruction is \"How do I detail a car?\" I'll start by looking at the instructions and the rules provided.\n\nFirst, the instruction doesn't contain any harmful content, so my main focus should be on how well the output honestly and precisely executes the instruction. I should also consider the helpfulness, accuracy, and detail of the response.\n\nLooking at Assistant 1's response, it starts by mentioning exterior detailing, including washing and interior detailing with specific tasks. However, it seems a bit brief and might be missing some steps. For example, it doesn't mention things like using the right equipment or the importance of drying the car properly. It also doesn't go into as much detail about the products used, which could be helpful.\n\nOn the other hand, Assistant 2's response is much more comprehensive. It provides a detailed step-by-step guide, starting with exterior detailing and then moving on to interior detailing. Each section has specific steps, like washing, drying, using a clay bar, polishing, waxing, and so on. It also includes tips about where to work, using separate buckets, and the importance of gentle materials. This makes the response thorough and informative, covering all aspects of car detailing.\n\nI notice that Assistant 2's response is longer but more detailed, which is appropriate because the instruction is about detailing a car, which can have multiple steps. Assistant 1's response, while it touches on some points, is more concise and might not cover all necessary steps, making it less helpful for someone looking to detail their car properly.\n\nAdditionally, Assistant 2 includes additional tips, which adds value to the response. It also mentions things like working in a shaded area and using separate buckets, which are practical tips that can improve the detailing process.\n\nIn terms of helpfulness, Assistant 2's response is more helpful because it provides a clear, step-by-step guide that someone can follow without getting overwhelmed. Assistant 1's response, while correct, is too brief and might not guide the user effectively.\n\nI also need to consider the length of the responses. Assistant 2's response is longer, but it's appropriate because the instruction is detailed enough to require a thorough explanation. Assistant 1's response is too short, which might not be sufficient for someone looking to detail their car properly.\n\nIn conclusion, considering the need for a thorough and helpful response, Assistant 2 is better because it provides a comprehensive guide with all the necessary steps and tips for detailing a car. Assistant 1, while correct, is too brief and lacks the depth needed for a complete answer.\n</think>\n\n\\boxed{Assistant 2}
 ```

 <!-- markdownlint-disable html -->
 <!-- markdownlint-disable no-duplicate-header -->
+<center>
+  <h2><a href="https://arxiv.org/abs/2505.14674">Paper Link👀</a></h2>
+</center>
 ## 1. Introduction
 We propose **Reward Reasoning Models (RRMs)**. Unlike existing reward models, RRMs frames reward modeling as a reasoning task, wherein the model first produces a long chain-of-thought reasoning process before generating the final rewards.
 Alright, so I need to figure out which assistant, Assistant 1 or Assistant 2, is better for the given instruction. The instruction is \"How do I detail a car?\" I'll start by looking at the instructions and the rules provided.\n\nFirst, the instruction doesn't contain any harmful content, so my main focus should be on how well the output honestly and precisely executes the instruction. I should also consider the helpfulness, accuracy, and detail of the response.\n\nLooking at Assistant 1's response, it starts by mentioning exterior detailing, including washing and interior detailing with specific tasks. However, it seems a bit brief and might be missing some steps. For example, it doesn't mention things like using the right equipment or the importance of drying the car properly. It also doesn't go into as much detail about the products used, which could be helpful.\n\nOn the other hand, Assistant 2's response is much more comprehensive. It provides a detailed step-by-step guide, starting with exterior detailing and then moving on to interior detailing. Each section has specific steps, like washing, drying, using a clay bar, polishing, waxing, and so on. It also includes tips about where to work, using separate buckets, and the importance of gentle materials. This makes the response thorough and informative, covering all aspects of car detailing.\n\nI notice that Assistant 2's response is longer but more detailed, which is appropriate because the instruction is about detailing a car, which can have multiple steps. Assistant 1's response, while it touches on some points, is more concise and might not cover all necessary steps, making it less helpful for someone looking to detail their car properly.\n\nAdditionally, Assistant 2 includes additional tips, which adds value to the response. It also mentions things like working in a shaded area and using separate buckets, which are practical tips that can improve the detailing process.\n\nIn terms of helpfulness, Assistant 2's response is more helpful because it provides a clear, step-by-step guide that someone can follow without getting overwhelmed. Assistant 1's response, while correct, is too brief and might not guide the user effectively.\n\nI also need to consider the length of the responses. Assistant 2's response is longer, but it's appropriate because the instruction is detailed enough to require a thorough explanation. Assistant 1's response is too short, which might not be sufficient for someone looking to detail their car properly.\n\nIn conclusion, considering the need for a thorough and helpful response, Assistant 2 is better because it provides a comprehensive guide with all the necessary steps and tips for detailing a car. Assistant 1, while correct, is too brief and lacks the depth needed for a complete answer.\n</think>\n\n\\boxed{Assistant 2}
 ```
+## 6. Citation
+```
+@misc{rewardreasoningmodel,
+      title={Reward Reasoning Model},
+      author={Jiaxin Guo and Zewen Chi and Li Dong and Qingxiu Dong and Xun Wu and Shaohan Huang and Furu Wei},
+      year={2025},
+      eprint={2505.14674},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.14674},
+}
+```