xl-zhao
/

PromptCoT-2.0-SelfPlay-4B

+---
+license: mit
+language:
+- en
+---
+# PromptCoT-2.0-SelfPlay-4B
+This model is part of **PromptCoT 2.0** (*Scaling Prompt Synthesis for LLM Reasoning*).
+It is a **4B model trained via self-play**, where synthesized problems from PromptCoT 2.0 provide **verifiable feedback** (unit tests for code, boxed answers for math).
+The training loop uses **Direct Preference Optimization (DPO)** to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.
+This model establishes **new state-of-the-art performance at the 4B scale**, consistently outperforming strong open-source baselines and curated datasets.
+---
+## ✨ Highlights
+- **Self-Play Training**:
+  The model improves autonomously using **synthetic math & code problems** generated by PromptCoT 2.0.
+  Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).
+- **Strong Baseline Improvements**:
+  Outperforms **Qwen3-4B-Thinking-2507** and surpasses curated datasets such as **OpenMathReasoning**, **OpenCodeReasoning**, and **OpenThoughts3** across all six benchmarks.
+---
+## 📊 Results
+Evaluation on six benchmarks under the **self-play setting with 4B parameters**.
+**Bold = best**, *Italic = second-best*.
+| Model                        | AIME 24 | AIME 25 | HMMT Feb 25 | LiveCodeBench v5 (2408–2502) | LiveCodeBench v6 (2502–2505) | Codeforces |
+|------------------------------|---------|---------|-------------|-------------------------------|-------------------------------|------------|
+| Qwen3-4B-Thinking-2507       | 85.2    | 81.3    | 55.5        | 63.8                          | 55.2                          | 1852       |
+| OpenCodeReasoning            | 83.1    | 78.5    | 50.4        | 64.4                          | *57.1*                        | 1867       |
+| OpenMathReasoning            | *85.3*  | *83.0*  | 56.8        | 59.7                          | 48.5                          | 1826       |
+| OpenThoughts3                | 84.7    | 80.6    | 54.2        | *65.2*                        | 54.4                          | 1846       |
+| OpenR1                       | 84.6    | 80.9    | 56.7        | 63.0                          | 54.6                          | 1829       |
+| PromptCoT 1.0                | *85.3*  | 81.8    | *58.6*      | 64.5                          | 56.7                          | *1878*     |
+| **PromptCoT 2.0**            | **87.3**| **85.0**| **66.5**    | **67.7**                      | **61.1**                      | **1934**   |
+---
+## 🔮 Key Takeaways
+* **Best across all six benchmarks**: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
+* **Large gains on high-difficulty tasks**: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
+* **Beyond curated baselines**: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements.
+---
+## 📂 Resources
+* 📄 Paper: [PromptCoT 2.0](https://arxiv.org/abs/2509.19894)
+* 💻 GitHub: [inclusionAI/PromptCoT](https://github.com/inclusionAI/PromptCoT)
+* 📊 Dataset: [PromptCoT-2.0-SelfPlay-4B-48K](https://huggingface.co/datasets/xl-zhao/PromptCoT-2.0-SelfPlay-4B-48K)
+---
+## 📜 Citation
+If you find this model useful, please consider citing:
+````bibtex
+@article{zhao2025promptcot2,
+  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
+  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
+  journal   = {arXiv preprint arXiv:2509.19894},
+  year      = {2025},
+  url       = {https://arxiv.org/abs/2509.19894}
+}
+````