Abstract
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.
Community
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
Data: https://huggingface.co/datasets/Qwen/ProcessBench
Evaluation code: https://github.com/QwenLM/ProcessBench
Here are some intriguing conclusions from a few experiments:
Presently, various PRMs that are based on MCTS for training data construction may not perform as effectively as directly training with the PRM800K dataset.
The more challenging the dataset, the higher the proportion of cases where the answer is correct but the process leading to it is flawed. In datasets of Omini-MATH level difficulty, this phenomenon occurs in over 50% of instances. Therefore, relying solely on answer matching as the reward rule might lead to scaling issues in the future.
Surprisingly, the reasoning model QwQ-32B-preview, which was not designed for the critic role and has not been trained on related data, performs exceptionally well in the critic function, surpassing all known PRM models to date.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Mathematical Reasoning in LLMs by Stepwise Correction (2024)
- Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning (2024)
- Preference Optimization for Reasoning with Pseudo Feedback (2024)
- Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks? (2024)
- A Comparative Study on Reasoning Patterns of OpenAI's o1 Model (2024)
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning (2024)
- Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Outstanding work and an immensely valuable artifact to publish! It's amazing to see this much effort towards creating meta-evaluations and studying process supervision ❤️
The findings and discussions on the shortcomings of PRMs are superb! The "policy-specialization" and subpar performance induced by the standard PRM training process are intuitively evident (once brought up, that is 🙂) but to my knowledge weren't properly demonstrated and quantified up until now.
Me and my team recently published two benchmarks on university-specific math, one of which — µ-MATH — is also a meta-benchmark. I was working on that one specifically, so meta-evals really hit home :)
https://huggingface.co/datasets/toloka/mu-math
We decided to make our own benchmarks because the popular available ones at the time were either at / below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds. We wanted explicit focus on university curricula problems and we wanted "organic" variety, so we created a bench of our own using problems sourced from teaching materials currently used in US universities.
For meta-evals, our focus was more on evaluating the LLM-as-a-judge approach: studying qualitative behavior differences of various judges and their biases, comparing closed- vs open-source judges.
One of the more curious findings of our work for me personally is the distinctively different behavior patterns exhibited by Qwen judges vs all the others. Qwens are brilliant at involved derivation chains which are necessary to properly check a solution against a reference answer, while leading closed-source models are more conservatively "anchored" on the exact form of the reference, so they admit far more false negatives. Seems that the Qwen team is very adept at crafting data that unlocks quality long-form reasoning chains, and that is also critical for math judgments. No wonder QwQ makes for such a good a critic model on the ProcessBench :)
Enriching our benchmarks with ProcessBench-style step-by-step feedback and studying process supervision quality was actually our next objective! Thanks again for sharing the work, the insights, and the dataset! Really happy to see open-source advancing in these topics.
Models citing this paper 3
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper