Robust Preference Optimization via Dynamic Target Margins
Abstract
The paper introduces ฮณ-PO, a dynamic target margin preference optimization algorithm that enhances Large Language Models' alignment by adjusting reward margins at the pairwise level, leading to improved performance with minimal impact on training.
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose gamma-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, gamma-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, gamma-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, gamma-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, gamma-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO{https://github.com/sunjie279/gammaPO}.
Community
18 pages, 6 figures, accepted to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL2025)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfoPO: On Mutual Information Maximization for Large Language Model Alignment (2025)
- Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model (2025)
- SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment (2025)
- Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment (2025)
- Energy-Based Reward Models for Robust Language Model Alignment (2025)
- Direct Advantage Regression: Aligning LLMs with Online AI Reward (2025)
- MPO: Multilingual Safety Alignment via Reward Gap Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper