PromptCoT-2.0-SelfPlay-4B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning).
It is a 4B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math).
The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers.

This model establishes new state-of-the-art performance at the 4B scale, consistently outperforming strong open-source baselines and curated datasets.


โœจ Highlights

  • Self-Play Training:
    The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0.
    Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness).

  • Strong Baseline Improvements:
    Outperforms Qwen3-4B-Thinking-2507 and surpasses curated datasets such as OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3 across all six benchmarks.


๐Ÿ“Š Results

Evaluation on six benchmarks under the self-play setting with 4B parameters.
Bold = best, Italic = second-best.

Model AIME 24 AIME 25 HMMT Feb 25 LiveCodeBench v5 (2408โ€“2502) LiveCodeBench v6 (2502โ€“2505) Codeforces
Qwen3-4B-Thinking-2507 85.2 81.3 55.5 63.8 55.2 1852
OpenCodeReasoning 83.1 78.5 50.4 64.4 57.1 1867
OpenMathReasoning 85.3 83.0 56.8 59.7 48.5 1826
OpenThoughts3 84.7 80.6 54.2 65.2 54.4 1846
OpenR1 84.6 80.9 56.7 63.0 54.6 1829
PromptCoT 1.0 85.3 81.8 58.6 64.5 56.7 1878
PromptCoT 2.0 87.3 85.0 66.5 67.7 61.1 1934

๐Ÿ”ฎ Key Takeaways

  • Best across all six benchmarks: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces.
  • Large gains on high-difficulty tasks: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best.
  • Beyond curated baselines: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3โ€”which saturate on strong 4B basesโ€”PromptCoT 2.0 continues to deliver significant improvements.

๐Ÿ“‚ Resources


๐Ÿ“œ Citation

If you find this model useful, please consider citing:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}
Downloads last month
16
Safetensors
Model size
4.02B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xl-zhao/PromptCoT-2.0-SelfPlay-4B

Quantizations
1 model

Collection including xl-zhao/PromptCoT-2.0-SelfPlay-4B