Qwen2.5-3B-Intuitor-MATH-1EPOCH
This model is an Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset, as presented in the paper Learning to Reason without External Rewards.
Introduction
Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).
Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.
Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.
For more details, see the project's GitHub repository.
Usage
You can use this model with the Hugging Face transformers
library.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # or torch.float16 depending on your GPU
device_map="auto"
)
messages = [
{"role": "user", "content": "What is the capital of France?"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=50,
temperature=0.7,
do_sample=True
)
output = tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output)
Citation
@article{zhao2025learning,
title = {Learning to Reason without External Rewards},
author = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal = {arXiv preprint arXiv:2505.19590},
year = {2025}
}
- Downloads last month
- 151