princetonu/sac-mad-aesac-aemad-gym8-kappas-maxalpha6-10seed-2p5m
Updated • 1
None defined yet.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Continual Harness: Online Adaptation for Self-Improving Foundation Agents