Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment
Abstract
PrefCleanBench evaluates 13 preference data cleaning methods for aligning large language models with human preferences, providing a standardized protocol to assess their effectiveness and generalizability.
Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Comprehensive Evaluation framework of Alignment Techniques for LLMs (2025)
- Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety (2025)
- Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning (2025)
- PROPS: Progressively Private Self-alignment of Large Language Models (2025)
- Benchmarking and Improving LLM Robustness for Personalized Generation (2025)
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization (2025)
- Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper