PromptCoT 2.0 - Prompt Model (pθ)
This is the Prompt Model (pθ) from the PromptCoT 2.0 implementation, trained using Expectation-Maximization (EM) algorithm to generate challenging mathematical problems given concepts and rationales.
Model Details
Model Description
This model is part of a dual-model system implementing PromptCoT 2.0:
- pθ (Prompt Model): Generates problems
xgiven conceptscand rationalez→p(x|z,c) - qφ (Rationale Model): Generates rationales
zgiven conceptscand problemx→q(z|c,x)
The models are trained iteratively using an EM loop:
- E-step: Generate K=8 rationale candidates, compute rewards, select best
- M-step: Fine-tune both models on selected (concept, rationale, problem) triples
- Developed by: Krzysztof Staroń
- Model type: LoRA fine-tuned Causal Language Model
- Language(s): English (mathematical reasoning)
- License: Apache 2.0 (inherited from Qwen2.5-7B)
- Finetuned from: Qwen/Qwen2.5-7B
Model Sources
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Paper: PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning (arXiv:2509.19894)
- Authors: Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
- Related Model: PromptCoT2.0
Uses
Direct Use
This model is designed to generate challenging mathematical problems given:
- Input format:
Concepts: c1 | c2 | ...\nRationale: [rationale text]\nProblem: - Output: Mathematical problem text
Example:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
concepts = "algebra | quadratic equations"
# It will think about the concepts, and then generate a problem after "Problem: "
prompt = f"Concepts: {concepts}\nRationale:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
problem = tokenizer.decode(outputs[0], skip_special_tokens=True)
Downstream Use
This model is part of the PromptCoT 2.0 EM training loop. Use it together with the rationale model (qφ) to:
- Generate synthetic training data for mathematical reasoning
- Improve problem-solving capabilities through iterative refinement
- Create challenging problem sets for educational purposes
Out-of-Scope Use
This model is specialized for mathematical reasoning and may not perform well for:
- General conversational tasks
- Non-mathematical problem generation
- Tasks requiring external knowledge beyond mathematical concepts
Bias, Risks, and Limitations
Known Limitations
- Domain Specificity: This model is trained specifically for mathematical reasoning and may not generalize well to other domains
- Training Data Bias: The model inherits biases from the seed dataset (AIME, GSM8K, Math500), which may reflect specific mathematical problem styles
- EM Convergence: The EM algorithm may converge to local optima, depending on initialization and hyperparameters
- Generated Quality: Generated problems may require manual validation for correctness and appropriateness
Recommendations
Users should:
- Validate Outputs: Always verify generated problems for mathematical correctness
- Use with Rationale Model: This model works best when paired with the rationale model (qφ) in the full EM loop
- Monitor Training: Check WandB logs for reward trends and training stability
- Iterative Refinement: The EM process requires multiple iterations for best results
How to Get Started with the Model
Installation
pip install transformers peft torch
Loading the Model
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
Generating Problems
concepts = "algebra | quadratic equations | factoring"
rationale = "To solve this problem, we need to factor the quadratic equation and find its roots..."
prompt = f"Concepts: {concepts}\nRationale: {rationale}\nProblem:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True
)
problem = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(problem.split("Problem:")[-1].strip())
Training Details
Training Data
Seed Dataset:
- 253 concept-rationale-problem triples from:
- AIME 2024/2025
- GSM8K
- Math500
- Format:
(concepts: List[str], rationale: str, problem: str)
Training Process:
- Cold Start: Warm-start both models via Maximum Likelihood Estimation (MLE) on seed dataset
- EM Loop: Iterative refinement through 10 EM iterations
- Each iteration generates K=8 rationale candidates per problem
- Selects best candidate based on reward function
- Fine-tunes both models on selected triples
Training Procedure
Preprocessing
- Tokenization: Left-padding, max_length=512 (EM loop) / 2048 (cold start)
- Format:
Concepts: c1 | c2 | ...\nRationale: z\nProblem: x - Masked cross-entropy loss (only tokens after "Problem:" keyword)
Training Hyperparameters
- Training regime: bfloat16 mixed precision
- LoRA Configuration:
r=64(rank)lora_alpha=16lora_dropout=0.05- Target modules:
["q_proj", "k_proj", "v_proj", "o_proj"]
- EM Loop:
- Batch size: 16
- K samples: 8 rationale candidates per problem
- Learning rate: 2e-5 (inferred from Trainer defaults)
- Epochs per M-step: 1
- Reward Function:
Where log probabilities are computed as negative cross-entropy loss.R(c,x,z) = log p(x|z,c) + log p(z|c)
Speeds, Sizes, Times
- Model Size: ~7B parameters (base) + ~0.02B (LoRA adapters)
- Hardware: H200 GPU (141 GB VRAM)
- Training Time: ~X hours per EM iteration (depending on dataset size)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Seed dataset: 253 triples (training/validation split if applicable)
- Generated data: Synthetic problems created during EM iterations
Metrics
- Reward Score: Average reward per iteration (R(c,x,z) = log p(x|z,c) + log p(z|c))
- Training Loss: Cross-entropy loss on selected triples
- Rationale Quality: Measured through reward-based selection
Results
Training progress is monitored via WandB:
- E-step reward statistics (avg, max, min)
- M-step training losses for both models
- Number of triples selected per iteration
Note: This is an ongoing training process. Final evaluation results will be updated upon completion of all EM iterations.
Summary
The model is trained using PromptCoT 2.0's EM algorithm, which iteratively improves both problem generation (pθ) and rationale generation (qφ) capabilities through reward-based selection.
Model Examination [optional]
[More Information Needed]
Technical Specifications
Model Architecture and Objective
- Base Architecture: Qwen2.5-7B-Instruct (Transformer decoder)
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Objective: Causal language modeling with masked cross-entropy
- Task: Generate problems
xgiven conceptscand rationalez
Compute Infrastructure
Hardware
- Training: NVIDIA H200 GPU (141 GB VRAM)
- Inference: Compatible with any GPU supporting bfloat16
Software
- Framework: PyTorch 2.0+
- Libraries:
- transformers
- peft (v0.17.1+)
- datasets
- wandb (for logging)
- CUDA: Compatible with CUDA 11.8+
Citation
If you use this model, please cite the PromptCoT 2.0 paper:
BibTeX:
@article{zhao2025promptcot2,
title={PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
author={Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
journal={arXiv preprint arXiv:2509.19894},
year={2025}
}
APA: Zhao, X., Wu, W., Guan, J., Gong, Z., & Kong, L. (2025). PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning. arXiv preprint arXiv:2509.19894.
Paper Link: https://arxiv.org/abs/2509.19894
Framework versions
- PEFT 0.17.1
- transformers 4.40.0+
- torch 2.0+
- Downloads last month
- 196