kevinpro commited on
Commit
b7ce17c
Β·
verified Β·
1 Parent(s): 4863973

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -38
README.md CHANGED
@@ -95,44 +95,6 @@ R-PRM demonstrates exceptional data efficiency under varying training scales:
95
 
96
  ![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
97
 
98
- ## πŸ› οΈ Training & Evaluation
99
-
100
- ### πŸ”§ Training Pipeline
101
-
102
- Our training consists of two key stages:
103
-
104
- 1. **Supervised Fine-Tuning (SFT)** We prompt stronger LLMs with PRM800K samples to construct seed data with detailed step-level analyses and correctness judgments. The model is then trained to generate both reasoning critiques and binary decisions.
105
- 2. **Direct Preference Optimization (DPO)** We sample multiple evaluation trajectories and construct preference pairs based on agreement with ground-truth labels. DPO encourages the model to generate consistent and accurate evaluations without requiring additional annotations.
106
-
107
- ### πŸ“¦ Dataset & Scale
108
-
109
- - **SFT Training Data**: ~289K samples generated via prompting LLaMA3.3-70B-Instruct
110
- - **DPO Preference Pairs**: ~269K pairs constructed from sampled trajectories
111
- - **Validation Set**: 20K held-out samples for early stopping
112
-
113
- ### βš™οΈ Model & Hyperparameters
114
-
115
- - **Base Model**: [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
116
- - **Batch Size**: 128
117
- - **Epochs**: 1
118
- - **Learning Rates**:
119
- - SFT: `5e-6`
120
- - DPO: `5e-7`
121
- - **Inference Trajectories**: Default `K=10` per step
122
-
123
- ### πŸ§ͺ Evaluation Protocol
124
-
125
- We evaluate the reward model across three main tasks:
126
-
127
- - **πŸ” ProcessBench**
128
- Step-level reasoning evaluation with F1 score.
129
- πŸ“„ *Script*: `src/scripts/examples/eval-ProcessBench.sh`
130
- - **🧠 PRMBench**
131
- Multi-dimensional evaluation across Simplicity, Soundness, and Sensitivity.
132
- πŸ“„ *Script*: `src/scripts/examples/eval-PRMBench.sh`
133
-
134
- πŸ”§ You can also use `src/utils/inference.py` to construct the training data (SFT and preference pairs).
135
-
136
  ## Citation
137
 
138
  If you find this repository helpful, feel free to cite our paper:
 
95
 
96
  ![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ## Citation
99
 
100
  If you find this repository helpful, feel free to cite our paper: