Qwen2.5-3B-Intuitor-MATH-1EPOCH

This model is an Intuitor-fine-tuned version of Qwen2.5-3B trained on the MATH dataset, as presented in the paper Learning to Reason without External Rewards.

Introduction

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm we call Reinforcement Learning from Internal Feedback (RLIF).

Reinforcement Learning from Internal Feedback (RLIF) is a training framework where language models learn without any external rewards, gold labels, or verifiers. Instead, models improve by optimizing intrinsic signals—such as confidence in their own answers—generated entirely from within. RLIF enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable.

Intuitor instantiates RLIF by using self-certainty—a model's confidence measured via KL divergence to uniform—as an intrinsic reward in the GRPO policy optimization algorithm.

For more details, see the project's GitHub repository.

Usage

You can use this model with the Hugging Face transformers library.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # or torch.float16 depending on your GPU
    device_map="auto"
)

messages = [
    {"role": "user", "content": "What is the capital of France?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=50,
    temperature=0.7,
    do_sample=True
)

output = tokenizer.decode(generated_ids[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output)

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}
Downloads last month
151
Safetensors
Model size
3.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen2.5-3B
Finetuned
(207)
this model
Quantizations
1 model

Collection including sunblaze-ucb/Qwen2.5-3B-Intuitor-MATH-1EPOCH