Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -95,44 +95,6 @@ R-PRM demonstrates exceptional data efficiency under varying training scales:
|
|
95 |
|
96 |

|
97 |
|
98 |
-
## π οΈ Training & Evaluation
|
99 |
-
|
100 |
-
### π§ Training Pipeline
|
101 |
-
|
102 |
-
Our training consists of two key stages:
|
103 |
-
|
104 |
-
1. **Supervised Fine-Tuning (SFT)** We prompt stronger LLMs with PRM800K samples to construct seed data with detailed step-level analyses and correctness judgments. The model is then trained to generate both reasoning critiques and binary decisions.
|
105 |
-
2. **Direct Preference Optimization (DPO)** We sample multiple evaluation trajectories and construct preference pairs based on agreement with ground-truth labels. DPO encourages the model to generate consistent and accurate evaluations without requiring additional annotations.
|
106 |
-
|
107 |
-
### π¦ Dataset & Scale
|
108 |
-
|
109 |
-
- **SFT Training Data**: ~289K samples generated via prompting LLaMA3.3-70B-Instruct
|
110 |
-
- **DPO Preference Pairs**: ~269K pairs constructed from sampled trajectories
|
111 |
-
- **Validation Set**: 20K held-out samples for early stopping
|
112 |
-
|
113 |
-
### βοΈ Model & Hyperparameters
|
114 |
-
|
115 |
-
- **Base Model**: [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
|
116 |
-
- **Batch Size**: 128
|
117 |
-
- **Epochs**: 1
|
118 |
-
- **Learning Rates**:
|
119 |
-
- SFT: `5e-6`
|
120 |
-
- DPO: `5e-7`
|
121 |
-
- **Inference Trajectories**: Default `K=10` per step
|
122 |
-
|
123 |
-
### π§ͺ Evaluation Protocol
|
124 |
-
|
125 |
-
We evaluate the reward model across three main tasks:
|
126 |
-
|
127 |
-
- **π ProcessBench**
|
128 |
-
Step-level reasoning evaluation with F1 score.
|
129 |
-
π *Script*: `src/scripts/examples/eval-ProcessBench.sh`
|
130 |
-
- **π§ PRMBench**
|
131 |
-
Multi-dimensional evaluation across Simplicity, Soundness, and Sensitivity.
|
132 |
-
π *Script*: `src/scripts/examples/eval-PRMBench.sh`
|
133 |
-
|
134 |
-
π§ You can also use `src/utils/inference.py` to construct the training data (SFT and preference pairs).
|
135 |
-
|
136 |
## Citation
|
137 |
|
138 |
If you find this repository helpful, feel free to cite our paper:
|
|
|
95 |
|
96 |

|
97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
## Citation
|
99 |
|
100 |
If you find this repository helpful, feel free to cite our paper:
|