Accelerating RL for LLM Reasoning with Optimal Advantage Regression Paper • 2505.20686 • Published 18 days ago • 2
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF Paper • 2405.21046 • Published May 31, 2024 • 4