Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper • 2305.18290 • Published May 29, 2023 • 48
Training language models to follow instructions with human feedback Paper • 2203.02155 • Published Mar 4, 2022 • 16
Self-Instruct: Aligning Language Model with Self Generated Instructions Paper • 2212.10560 • Published Dec 20, 2022 • 8
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback Paper • 2305.14387 • Published May 22, 2023 • 1
ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12 • 62