Intuitor
Collection
Models in the paper "Learning to Reason without External Rewards"
•
12 items
•
Updated
Description:
A GRPO-fine-tuned version of Allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset with system prompt.
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}