Abstract
With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.
Community
This is a step in the right direction. The problem with RLHF, is the H component. Many different ideologies are competing and are averaged out. Look at the left/right "culture war" and you can see how ineffective this is. Current world affairs showcase just how deep the divide is.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SALMON: Self-Alignment with Principle-Following Reward Models (2023)
- Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment (2023)
- Improving Generalization of Alignment with Human Preferences through Group Invariant Learning (2023)
- RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (2023)
- Stabilizing RLHF through Advantage Model and Selective Rehearsal (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
hi
@mattbarr
, librarian-bot is managed by HF Staff and aims to recommend additional papers to our users. You can find more information about it on the following page.: https://huggingface.co/librarian-bots
Let me know if you need more explanation.
Best
Rom
Enhancing AI Safety with Safe RLHF: Balancing Helpfulness and Harmlessness
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 12
Browse 12 models citing this paperDatasets citing this paper 0
No dataset linking this paper