Xuandong nielsr HF Staff commited on
Commit
9f26613
·
verified ·
1 Parent(s): 5a9d72f

Improve model card: Add library, links, and usage example (#1)

Browse files

- Improve model card: Add library, links, and usage example (8dc4360ef59581ce229c4d7992a2a92a49e13eb6)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +66 -9
README.md CHANGED
@@ -1,25 +1,82 @@
1
  ---
2
  base_model: Qwen/Qwen3-14B
3
- license: apache-2.0
4
  datasets:
5
- - math
 
 
 
6
  metrics:
7
- - accuracy
8
  pipeline_tag: text-generation
9
- language:
10
- - en
 
 
 
 
11
  ---
12
 
13
- # Qwen/Qwen3-14B-GRPO-MATH-1EPOCH
 
 
 
 
14
 
15
- **Description:**
16
 
17
- A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ---
20
 
21
  ## Citation
22
 
 
 
23
  ```bibtex
24
  @article{zhao2025learning,
25
  title = {Learning to Reason without External Rewards},
@@ -27,4 +84,4 @@ A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
27
  journal = {arXiv preprint arXiv:2505.19590},
28
  year = {2025}
29
  }
30
- ```
 
1
  ---
2
  base_model: Qwen/Qwen3-14B
 
3
  datasets:
4
+ - math
5
+ language:
6
+ - en
7
+ license: apache-2.0
8
  metrics:
9
+ - accuracy
10
  pipeline_tag: text-generation
11
+ library_name: transformers
12
+ tags:
13
+ - reinforcement-learning
14
+ - llm
15
+ - reasoning
16
+ - math
17
  ---
18
 
19
+ # sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH
20
+
21
+ [📄 Paper](https://huggingface.co/papers/2505.19590) | [🌐 Project Page](https://sites.google.com/view/eagle-llm) | [💻 GitHub](https://github.com/sunblaze-ucb/intuitor)
22
+
23
+ **Description:**
24
 
25
+ This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the **Intuitor** project, presented in the paper "Learning to Reason without External Rewards".
26
 
27
+ **Intuitor** is a novel reinforcement learning method that leverages *self-certainty*—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called **Reinforcement Learning from Internal Feedback (RLIF)**, which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.
28
+
29
+ This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.
30
+
31
+ ---
32
+
33
+ ## Usage
34
+
35
+ You can use this model with the `transformers` library for text generation.
36
+
37
+ ```python
38
+ import torch
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+
41
+ model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ model_id,
46
+ torch_dtype=torch.bfloat16,
47
+ device_map="auto",
48
+ trust_remote_code=True
49
+ )
50
+ model.eval()
51
+
52
+ # Example using a chat-like template, typical for instruction-tuned models like Qwen.
53
+ # Adjust prompt format as needed for your specific use case.
54
+ messages = [
55
+ {"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
56
+ ]
57
+
58
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
59
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
60
+
61
+ generated_ids = model.generate(
62
+ model_inputs.input_ids,
63
+ max_new_tokens=100,
64
+ do_sample=True,
65
+ temperature=0.7,
66
+ top_p=0.9,
67
+ eos_token_id=tokenizer.eos_token_id
68
+ )
69
+
70
+ generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
71
+ print(generated_text)
72
+ ```
73
 
74
  ---
75
 
76
  ## Citation
77
 
78
+ If you use Intuitor in your research, please cite our paper:
79
+
80
  ```bibtex
81
  @article{zhao2025learning,
82
  title = {Learning to Reason without External Rewards},
 
84
  journal = {arXiv preprint arXiv:2505.19590},
85
  year = {2025}
86
  }
87
+ ```