nielsr HF Staff commited on
Commit
8dc4360
·
verified ·
1 Parent(s): 5a9d72f

Improve model card: Add library, links, and usage example

Browse files

This PR significantly enhances the model card for `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` by:

* Updating the main heading to reflect the full model ID: `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH`.
* Adding `library_name: transformers` to the metadata, enabling the "how to use" widget and improving discoverability.
* Including descriptive `tags` such as `reinforcement-learning`, `llm`, `reasoning`, and `math` for better categorization.
* Providing an expanded model description based on the paper abstract and project details, giving users a better understanding of the model and the underlying "Intuitor" and "RLIF" frameworks.
* Adding explicit links to the paper, the project page, and the GitHub repository for easy access to source materials and code.
* Including a clear Python code snippet for sample usage with the `transformers` library, making it easier for users to get started with inference.

Please review and merge this PR to improve the model's visibility and usability on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +66 -9
README.md CHANGED
@@ -1,25 +1,82 @@
1
  ---
2
  base_model: Qwen/Qwen3-14B
3
- license: apache-2.0
4
  datasets:
5
- - math
 
 
 
6
  metrics:
7
- - accuracy
8
  pipeline_tag: text-generation
9
- language:
10
- - en
 
 
 
 
11
  ---
12
 
13
- # Qwen/Qwen3-14B-GRPO-MATH-1EPOCH
 
 
 
 
14
 
15
- **Description:**
16
 
17
- A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ---
20
 
21
  ## Citation
22
 
 
 
23
  ```bibtex
24
  @article{zhao2025learning,
25
  title = {Learning to Reason without External Rewards},
@@ -27,4 +84,4 @@ A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
27
  journal = {arXiv preprint arXiv:2505.19590},
28
  year = {2025}
29
  }
30
- ```
 
1
  ---
2
  base_model: Qwen/Qwen3-14B
 
3
  datasets:
4
+ - math
5
+ language:
6
+ - en
7
+ license: apache-2.0
8
  metrics:
9
+ - accuracy
10
  pipeline_tag: text-generation
11
+ library_name: transformers
12
+ tags:
13
+ - reinforcement-learning
14
+ - llm
15
+ - reasoning
16
+ - math
17
  ---
18
 
19
+ # sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH
20
+
21
+ [📄 Paper](https://huggingface.co/papers/2505.19590) | [🌐 Project Page](https://sites.google.com/view/eagle-llm) | [💻 GitHub](https://github.com/sunblaze-ucb/intuitor)
22
+
23
+ **Description:**
24
 
25
+ This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the **Intuitor** project, presented in the paper "Learning to Reason without External Rewards".
26
 
27
+ **Intuitor** is a novel reinforcement learning method that leverages *self-certainty*—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called **Reinforcement Learning from Internal Feedback (RLIF)**, which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.
28
+
29
+ This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.
30
+
31
+ ---
32
+
33
+ ## Usage
34
+
35
+ You can use this model with the `transformers` library for text generation.
36
+
37
+ ```python
38
+ import torch
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+
41
+ model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ model_id,
46
+ torch_dtype=torch.bfloat16,
47
+ device_map="auto",
48
+ trust_remote_code=True
49
+ )
50
+ model.eval()
51
+
52
+ # Example using a chat-like template, typical for instruction-tuned models like Qwen.
53
+ # Adjust prompt format as needed for your specific use case.
54
+ messages = [
55
+ {"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
56
+ ]
57
+
58
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
59
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
60
+
61
+ generated_ids = model.generate(
62
+ model_inputs.input_ids,
63
+ max_new_tokens=100,
64
+ do_sample=True,
65
+ temperature=0.7,
66
+ top_p=0.9,
67
+ eos_token_id=tokenizer.eos_token_id
68
+ )
69
+
70
+ generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
71
+ print(generated_text)
72
+ ```
73
 
74
  ---
75
 
76
  ## Citation
77
 
78
+ If you use Intuitor in your research, please cite our paper:
79
+
80
  ```bibtex
81
  @article{zhao2025learning,
82
  title = {Learning to Reason without External Rewards},
 
84
  journal = {arXiv preprint arXiv:2505.19590},
85
  year = {2025}
86
  }
87
+ ```