Training Step-Level Reasoning Verifiers with Formal Verification Tools
Abstract
FoVer is a method for automatically annotating step-level error labels using formal verification tools to train Process Reward Models, which significantly improves cross-task generalization and outperforms human-annotated methods in various reasoning benchmarks.
Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.
Community
- Project website: https://fover-prm.github.io/
- Code: https://github.com/psunlpgroup/FoVer
- Models & Datasets: https://huggingface.co/collections/ryokamoi/fover-682e28cc9f6200c7dfd5342f
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification (2025)
- HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement (2025)
- VerifiAgent: a Unified Verification Agent in Language Model Reasoning (2025)
- T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models (2025)
- Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework (2025)
- Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach (2025)
- APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 7
Browse 7 datasets citing this paperSpaces citing this paper 0
No Space linking this paper