hkust-nlp
/

Qwen-2.5-0.5B-SimpleRL-Zoo

Improve model card: Add metadata, paper link and code link

by nielsr HF Staff - opened Mar 30

←

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,11 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: question-answering
+---
+This model is presented in the paper [SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild](https://huggingface.co/papers/2503.18892).
+This repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward and GSM8K/Math datasets are used. We have used this code to successfully train 10 diverse base models with limited data (8K examples), achieving surprisingly strong results -- the accuracy gains range from 10 to more than 20 absolute points. These models include Llama3 8B, Mistral 7B/24B, DeepSeekMath 7B, Qwen2.5 0.5B/1.5B/7B/14B/32B, and Qwen2.5-Math-7B. While we observe significant increase in both response length and accuracy, we note that different models exhibit distinct reasoning behaviors during training, and the increased response length does not necessarily correlate with emergence of certain cognitive behaviors such as self-verification. We share many findings and practices in our paper, and we release the code, model checkpoints, and analysis tools here.
+Code: https://github.com/hkust-nlp/simpleRL-reason/tree/v1