The Alignment Waltz: Jointly Training Agents to Collaborate for Safety Paper • 2510.08240 • Published 4 days ago • 34
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense Paper • 2510.07242 • Published 5 days ago • 28
RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization Paper • 2510.02172 • Published 11 days ago • 7
Jointly Reinforcing Diversity and Quality in Language Model Generations Paper • 2509.02534 • Published Sep 2 • 25