Qwen3-14B-GRPO-MATH-1EPOCH / README.md

Xuandong

Improve model card: Add library, links, and usage example (#1)

9f26613 verified about 2 months ago

preview code

raw

history blame contribute delete

3.04 kB

metadata

base_model: Qwen/Qwen3-14B
datasets:
  - math
language:
  - en
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: text-generation
library_name: transformers
tags:
  - reinforcement-learning
  - llm
  - reasoning
  - math

sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH

📄 Paper | 🌐 Project Page | 💻 GitHub

Description:

This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the Intuitor project, presented in the paper "Learning to Reason without External Rewards".

Intuitor is a novel reinforcement learning method that leverages self-certainty—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called Reinforcement Learning from Internal Feedback (RLIF), which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.

This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.

Usage

You can use this model with the transformers library for text generation.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
model.eval()

# Example using a chat-like template, typical for instruction-tuned models like Qwen.
# Adjust prompt format as needed for your specific use case.
messages = [
    {"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

Citation

If you use Intuitor in your research, please cite our paper:

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}