Improve model card: Add library, links, and usage example

This PR significantly enhances the model card for `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH` by:

* Updating the main heading to reflect the full model ID: `sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH`.
* Adding `library_name: transformers` to the metadata, enabling the "how to use" widget and improving discoverability.
* Including descriptive `tags` such as `reinforcement-learning`, `llm`, `reasoning`, and `math` for better categorization.
* Providing an expanded model description based on the paper abstract and project details, giving users a better understanding of the model and the underlying "Intuitor" and "RLIF" frameworks.
* Adding explicit links to the paper, the project page, and the GitHub repository for easy access to source materials and code.
* Including a clear Python code snippet for sample usage with the `transformers` library, making it easier for users to get started with inference.

Please review and merge this PR to improve the model's visibility and usability on the Hugging Face Hub.

Files changed (1) hide show

README.md +66 -9

README.md CHANGED Viewed

@@ -1,25 +1,82 @@
 ---
 base_model: Qwen/Qwen3-14B
-license: apache-2.0
 datasets:
-  - math
 metrics:
-  - accuracy
 pipeline_tag: text-generation
-language:
-  - en
 ---
-# Qwen/Qwen3-14B-GRPO-MATH-1EPOCH
-**Description:**
-A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
 ---
 ## Citation
 ```bibtex
 @article{zhao2025learning,
   title   = {Learning to Reason without External Rewards},
@@ -27,4 +84,4 @@ A GRPO-fine-tuned version of Qwen3-14B trained on the MATH dataset.
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
-```

 ---
 base_model: Qwen/Qwen3-14B
 datasets:
+- math
+language:
+- en
+license: apache-2.0
 metrics:
+- accuracy
 pipeline_tag: text-generation
+library_name: transformers
+tags:
+- reinforcement-learning
+- llm
+- reasoning
+- math
 ---
+# sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH
+[📄 Paper](https://huggingface.co/papers/2505.19590) | [🌐 Project Page](https://sites.google.com/view/eagle-llm) | [💻 GitHub](https://github.com/sunblaze-ucb/intuitor)
+**Description:**
+This model is a GRPO-fine-tuned version of Qwen3-14B, specifically trained on the MATH dataset. It is part of the **Intuitor** project, presented in the paper "Learning to Reason without External Rewards".
+**Intuitor** is a novel reinforcement learning method that leverages *self-certainty*—the model’s own internal confidence—as its sole reward signal to fine-tune large language models (LLMs). This approach falls under a new framework called **Reinforcement Learning from Internal Feedback (RLIF)**, which enables LLMs to learn effectively from intrinsic signals, circumventing the need for costly external rewards, gold labels, or verifiers. This makes RLIF a scalable and domain-agnostic alternative to traditional RL methods, particularly useful when verifiable rewards are unavailable.
+This particular model demonstrates Intuitor's ability to match GRPO's performance on mathematical benchmarks while showing superior generalization to out-of-domain tasks like code generation, all without requiring gold solutions or test cases.
+---
+## Usage
+You can use this model with the `transformers` library for text generation.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "sunblaze-ucb/Qwen3-14B-GRPO-MATH-1EPOCH"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+model.eval()
+# Example using a chat-like template, typical for instruction-tuned models like Qwen.
+# Adjust prompt format as needed for your specific use case.
+messages = [
+    {"role": "user", "content": "Question: Solve the following equation: $x + 7 = 15$. Show your steps. Answer:"}
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+    eos_token_id=tokenizer.eos_token_id
+)
+generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+print(generated_text)
+```
 ---
 ## Citation
+If you use Intuitor in your research, please cite our paper:
 ```bibtex
 @article{zhao2025learning,
   title   = {Learning to Reason without External Rewards},
   journal = {arXiv preprint arXiv:2505.19590},
   year    = {2025}
 }
+```