sunblaze-ucb
/

Qwen3-14B-GRPO-MATH-1EPOCH

Text Generation

Transformers

Safetensors

English

qwen3

reinforcement-learning

text-generation-inference

Model card Files Files and versions

xet

Community

Xuandong

nielsr HF Staff commited on Aug 13

Commit

9f26613

verified ·

1 Parent(s): 5a9d72f

Improve model card: Add library, links, and usage example (#1)

Browse files

- Improve model card: Add library, links, and usage example (8dc4360ef59581ce229c4d7992a2a92a49e13eb6)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +66 -9

README.md CHANGED Viewed

@@ -1,25 +1,82 @@
 ---
 base_model: Qwen/Qwen3-14B
-license: apache-2.0
 datasets:
-  - math
 metrics:
-  - accuracy
 pipeline_tag: text-generation
-language:
-  - en
 ---
-# Qwen/Qwen3-14B-GRPO-MATH-1EPOCH
-**Description:**
-A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
 ---
 ## Citation
 ```bibtex
 @article{zhao2025learning,
   title   = {Learning to Reason without External Rewards},
@@ -27,4 +84,4 @@ A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
-```

 ---
 base_model: Qwen/Qwen3-14B
 datasets:
+- math
+language:
+- en
+license: apache-2.0
 metrics:
+- accuracy
 pipeline_tag: text-generation
+library_name: transformers
+tags:
+- reinforcement-learning
+- llm
+- reasoning
+- math
 ---
+# sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH
+[📄 Paper](https://huggingface.co/papers/2505.19590) | [🌐 Project Page](https://sites.google.com/view/eagle-llm) | [💻 GitHub](https://github.com/sunblaze-ucb/intuitor)
+**Description:**
+This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the **Intuitor** project, presented in the paper "Learning to Reason without External Rewards".
+**Intuitor** is a novel reinforcement learning method that leverages *self-certainty*—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called **Reinforcement Learning from Internal Feedback (RLIF)**, which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.
+This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.
+---
+## Usage
+You can use this model with the `transformers` library for text generation.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+model.eval()
+# Example using a chat-like template, typical for instruction-tuned models like Qwen.
+# Adjust prompt format as needed for your specific use case.
+messages = [
+    {"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+    eos_token_id=tokenizer.eos_token_id
+)
+generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+print(generated_text)
+```
 ---
 ## Citation
+If you use Intuitor in your research, please cite our paper:
 ```bibtex
 @article{zhao2025learning,
   title   = {Learning to Reason without External Rewards},
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
+```