Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training Paper • 2509.03403 • Published 6 days ago • 20
Self-Rewarding Vision-Language Model via Reasoning Decomposition Paper • 2508.19652 • Published 14 days ago • 82
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback Paper • 2310.11550 • Published Oct 17, 2023 • 1