PromptCoT-2.0-SelfPlay-4B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning).
It is a 4B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math).
The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.

This model establishes new state-of-the-art performance at the 4B scale, consistently outperforming strong open-source baselines and curated datasets.

✨ Highlights

Self-Play Training:
The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0.
Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).
Strong Baseline Improvements:
Outperforms Qwen3-4B-Thinking-2507 and surpasses curated datasets such as OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3 across all six benchmarks.

📊 Results

Evaluation on six benchmarks under the self-play setting with 4B parameters.
Bold = best, Italic = second-best.

Model	AIME 24	AIME 25	HMMT Feb 25	LiveCodeBench v5 (2408–2502)	LiveCodeBench v6 (2502–2505)	Codeforces
Qwen3-4B-Thinking-2507	85.2	81.3	55.5	63.8	55.2	1852
OpenCodeReasoning	83.1	78.5	50.4	64.4	57.1	1867
OpenMathReasoning	85.3	83.0	56.8	59.7	48.5	1826
OpenThoughts3	84.7	80.6	54.2	65.2	54.4	1846
OpenR1	84.6	80.9	56.7	63.0	54.6	1829
PromptCoT 1.0	85.3	81.8	58.6	64.5	56.7	1878
PromptCoT 2.0	87.3	85.0	66.5	67.7	61.1	1934

🔮 Key Takeaways

Best across all six benchmarks: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
Large gains on high-difficulty tasks: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
Beyond curated baselines: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements.

📂 Resources

📄 Paper: PromptCoT 2.0
💻 GitHub: inclusionAI/PromptCoT
📊 Dataset: PromptCoT-2.0-SelfPlay-4B-48K

📜 Citation

If you find this model useful, please consider citing:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}

Downloads last month: 16

Safetensors

Model size

4.02B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xl-zhao/PromptCoT-2.0-SelfPlay-4B

Quantizations

1 model

Collection including xl-zhao/PromptCoT-2.0-SelfPlay-4B

PromptCoT 2.0

Collection

8 items • Updated 6 days ago • 1