OLMo-2-7B-SFT-GRPO-MATH-1EPOCH

This model is a GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset.

This model is associated with the paper Learning to Reason without External Rewards, which introduces Intuitor, a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. This approach is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF), enabling models to learn without external rewards, gold labels, or verifiers by optimizing intrinsic signals.


Project Page & Code

Usage

You can load and use this model with the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH"

# It's recommended to load with bfloat16 for OLMo-2 models if supported by your hardware
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# Example usage:
prompt = "Question: What is 2 + 2?
Answer:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Citation

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}
Downloads last month
22
Safetensors
Model size
7.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH

Finetuned
(8)
this model

Collection including sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH