PromptCoT-2.0-SFT-7B
This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning).
It is a 7B parameter model trained entirely on synthetic prompts generated by PromptCoT 2.0, with reasoning trajectories distilled from GPT-OSS-120B (medium).
Unlike prior works (e.g., OpenMathReasoning, OpenCodeReasoning) that rely on human-written prompts, this model demonstrates that fully synthetic data can match or even surpass the effectiveness of manually curated datasets for advancing reasoning in both mathematics and programming.
๐ Comparison
PromptCoT-2.0-SFT-7B is trained 100% on synthetic prompts with teacher trajectories from GPT-OSS-120B (medium).
Below we compare it against two widely used human-written prompt baselines.
Metric: Pass@1 for AIME24/25, HMMT Feb25, LiveCodeBench v5/v6; Elo for Codeforces.
Model | Prompt Source | Teacher | AIME24 | AIME25 | HMMT Feb25 | LiveCodeBench v5 (2408-2502) | LiveCodeBench v6 (2502-2505) | Codeforces |
---|---|---|---|---|---|---|---|---|
PromptCoT-2.0-SFT-7B | Synthetic | GPT-OSS-120B (med.) | 73.1 | 65.6 | 46.5 | 53.4 | 48.9 | 1815 |
OpenMathReasoning | Human | DeepSeek-R1 | 73.3 | 58.1 | 42.1 | 9.7 | 10.7 | 676 |
OpenCodeReasoning | Human | DeepSeek-R1 | 11.7 | 7.7 | 6.0 | 50.5 | 42.0 | 1648 |
Takeaways
- Fully synthetic wins: PromptCoT-2.0-SFT-7B outperforms human-written baselines across most math benchmarks and all code benchmarks.
- Scalable & practical: High performance without manual prompt curation suggests a clear path to scaling reasoning with synthetic data.
๐ Usage
You can load the model via Hugging Face transformers
:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "xl-zhao/PromptCoT-2.0-SFT-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
prompt = "Solve for x: If 2x + 5 = 17, what is the value of x?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐ Training Details
- Data: 4.8M fully synthetic prompts generated by PromptCoT 2.0
- Teacher: GPT-OSS-120B (medium), used for reasoning trajectory distillation
- Domains: Mathematics (Olympiad-level) and Programming (competitive coding)
- Training regime: Supervised fine-tuning (SFT), 100% synthetic data
๐ฎ Key Insights
- Fully synthetic prompts work: No reliance on human-written datasets.
- Compact trajectories: Distilled responses are shorter than those in prior datasets, reducing inference cost while maintaining quality.
- Scalability: Opens the door for training larger reasoning models on purely synthetic corpora.
๐ Citation
If you use this model or the PromptCoT 2.0 dataset, please cite:
@article{zhao2025promptcot2,
title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
journal = {arXiv preprint arXiv:2509.19894},
year = {2025},
url = {https://arxiv.org/abs/2509.19894}
}
- Downloads last month
- 11