xl-zhao commited on
Commit
46e6f25
·
verified ·
1 Parent(s): e763553

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
6
+ # PromptCoT-2.0-SelfPlay-4B
7
+
8
+ This model is part of **PromptCoT 2.0** (*Scaling Prompt Synthesis for LLM Reasoning*).
9
+ It is a **4B model trained via self-play**, where synthesized problems from PromptCoT 2.0 provide **verifiable feedback** (unit tests for code, boxed answers for math).
10
+ The training loop uses **Direct Preference Optimization (DPO)** to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.
11
+
12
+ This model establishes **new state-of-the-art performance at the 4B scale**, consistently outperforming strong open-source baselines and curated datasets.
13
+
14
+ ---
15
+
16
+ ## ✨ Highlights
17
+
18
+ - **Self-Play Training**:
19
+ The model improves autonomously using **synthetic math & code problems** generated by PromptCoT 2.0.
20
+ Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).
21
+
22
+ - **Strong Baseline Improvements**:
23
+ Outperforms **Qwen3-4B-Thinking-2507** and surpasses curated datasets such as **OpenMathReasoning**, **OpenCodeReasoning**, and **OpenThoughts3** across all six benchmarks.
24
+
25
+ ---
26
+
27
+ ## 📊 Results
28
+
29
+ Evaluation on six benchmarks under the **self-play setting with 4B parameters**.
30
+ **Bold = best**, *Italic = second-best*.
31
+
32
+ | Model | AIME 24 | AIME 25 | HMMT Feb 25 | LiveCodeBench v5 (2408–2502) | LiveCodeBench v6 (2502–2505) | Codeforces |
33
+ |------------------------------|---------|---------|-------------|-------------------------------|-------------------------------|------------|
34
+ | Qwen3-4B-Thinking-2507 | 85.2 | 81.3 | 55.5 | 63.8 | 55.2 | 1852 |
35
+ | OpenCodeReasoning | 83.1 | 78.5 | 50.4 | 64.4 | *57.1* | 1867 |
36
+ | OpenMathReasoning | *85.3* | *83.0* | 56.8 | 59.7 | 48.5 | 1826 |
37
+ | OpenThoughts3 | 84.7 | 80.6 | 54.2 | *65.2* | 54.4 | 1846 |
38
+ | OpenR1 | 84.6 | 80.9 | 56.7 | 63.0 | 54.6 | 1829 |
39
+ | PromptCoT 1.0 | *85.3* | 81.8 | *58.6* | 64.5 | 56.7 | *1878* |
40
+ | **PromptCoT 2.0** | **87.3**| **85.0**| **66.5** | **67.7** | **61.1** | **1934** |
41
+
42
+ ---
43
+
44
+
45
+ ## 🔮 Key Takeaways
46
+
47
+ * **Best across all six benchmarks**: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
48
+ * **Large gains on high-difficulty tasks**: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
49
+ * **Beyond curated baselines**: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements.
50
+
51
+ ---
52
+
53
+ ## 📂 Resources
54
+
55
+ * 📄 Paper: [PromptCoT 2.0](https://arxiv.org/abs/2509.19894)
56
+ * 💻 GitHub: [inclusionAI/PromptCoT](https://github.com/inclusionAI/PromptCoT)
57
+ * 📊 Dataset: [PromptCoT-2.0-SelfPlay-4B-48K](https://huggingface.co/datasets/xl-zhao/PromptCoT-2.0-SelfPlay-4B-48K)
58
+
59
+ ---
60
+
61
+ ## 📜 Citation
62
+
63
+ If you find this model useful, please consider citing:
64
+
65
+ ````bibtex
66
+ @article{zhao2025promptcot2,
67
+ title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
68
+ author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
69
+ journal = {arXiv preprint arXiv:2509.19894},
70
+ year = {2025},
71
+ url = {https://arxiv.org/abs/2509.19894}
72
+ }
73
+ ````