kevinpro
/

R-PRM-7B-DPO

@@ -95,44 +95,6 @@ R-PRM demonstrates exceptional data efficiency under varying training scales:
 ![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
-## 🛠️ Training & Evaluation
-### 🔧 Training Pipeline
-Our training consists of two key stages:
-1. **Supervised Fine-Tuning (SFT)** We prompt stronger LLMs with PRM800K samples to construct seed data with detailed step-level analyses and correctness judgments. The model is then trained to generate both reasoning critiques and binary decisions.
-2. **Direct Preference Optimization (DPO)** We sample multiple evaluation trajectories and construct preference pairs based on agreement with ground-truth labels. DPO encourages the model to generate consistent and accurate evaluations without requiring additional annotations.
-### 📦 Dataset & Scale
-- **SFT Training Data**: ~289K samples generated via prompting LLaMA3.3-70B-Instruct
-- **DPO Preference Pairs**: ~269K pairs constructed from sampled trajectories
-- **Validation Set**: 20K held-out samples for early stopping
-### ⚙️ Model & Hyperparameters
-- **Base Model**: [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
-- **Batch Size**: 128
-- **Epochs**: 1
-- **Learning Rates**:
-  - SFT: `5e-6`
-  - DPO: `5e-7`
-- **Inference Trajectories**: Default `K=10` per step
-### 🧪 Evaluation Protocol
-We evaluate the reward model across three main tasks:
-- **🔍 ProcessBench**
-  Step-level reasoning evaluation with F1 score.
-  📄 *Script*: `src/scripts/examples/eval-ProcessBench.sh`
-- **🧠 PRMBench**
-  Multi-dimensional evaluation across Simplicity, Soundness, and Sensitivity.
-  📄 *Script*: `src/scripts/examples/eval-PRMBench.sh`
-🔧 You can also use `src/utils/inference.py` to construct the training data (SFT and preference pairs).
 ## Citation
 If you find this repository helpful, feel free to cite our paper:

 ![Figure3: ProcessBench Scaling](./fig/processbench-scaling.png)
 ## Citation
 If you find this repository helpful, feel free to cite our paper: