PromptCoT 2.0 - Prompt Model (pθ)

This is the Prompt Model (pθ) from the PromptCoT 2.0 implementation, trained using Expectation-Maximization (EM) algorithm to generate challenging mathematical problems given concepts and rationales.

Model Details

Model Description

This model is part of a dual-model system implementing PromptCoT 2.0:

pθ (Prompt Model): Generates problems x given concepts c and rationale z → p(x|z,c)
qφ (Rationale Model): Generates rationales z given concepts c and problem x → q(z|c,x)

The models are trained iteratively using an EM loop:

E-step: Generate K=8 rationale candidates, compute rewards, select best
M-step: Fine-tune both models on selected (concept, rationale, problem) triples

Developed by: Krzysztof Staroń
Model type: LoRA fine-tuned Causal Language Model
Language(s): English (mathematical reasoning)
License: Apache 2.0 (inherited from Qwen2.5-7B)
Finetuned from: Qwen/Qwen2.5-7B

Model Sources

Base Model: Qwen/Qwen2.5-7B-Instruct
Paper: PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning (arXiv:2509.19894)
Authors: Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
Related Model: PromptCoT2.0

Uses

Direct Use

This model is designed to generate challenging mathematical problems given:

Input format: Concepts: c1 | c2 | ...\nRationale: [rationale text]\nProblem:
Output: Mathematical problem text

Example:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

concepts = "algebra | quadratic equations"
# It will think about the concepts, and then generate a problem after "Problem: "
prompt = f"Concepts: {concepts}\nRationale:"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
problem = tokenizer.decode(outputs[0], skip_special_tokens=True)

Downstream Use

This model is part of the PromptCoT 2.0 EM training loop. Use it together with the rationale model (qφ) to:

Generate synthetic training data for mathematical reasoning
Improve problem-solving capabilities through iterative refinement
Create challenging problem sets for educational purposes

Out-of-Scope Use

This model is specialized for mathematical reasoning and may not perform well for:

General conversational tasks
Non-mathematical problem generation
Tasks requiring external knowledge beyond mathematical concepts

Bias, Risks, and Limitations

Known Limitations

Domain Specificity: This model is trained specifically for mathematical reasoning and may not generalize well to other domains
Training Data Bias: The model inherits biases from the seed dataset (AIME, GSM8K, Math500), which may reflect specific mathematical problem styles
EM Convergence: The EM algorithm may converge to local optima, depending on initialization and hyperparameters
Generated Quality: Generated problems may require manual validation for correctness and appropriateness

Recommendations

Users should:

Validate Outputs: Always verify generated problems for mathematical correctness
Use with Rationale Model: This model works best when paired with the rationale model (qφ) in the full EM loop
Monitor Training: Check WandB logs for reward trends and training stability
Iterative Refinement: The EM process requires multiple iterations for best results

How to Get Started with the Model

Installation

pip install transformers peft torch

Loading the Model

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Generating Problems

concepts = "algebra | quadratic equations | factoring"
rationale = "To solve this problem, we need to factor the quadratic equation and find its roots..."

prompt = f"Concepts: {concepts}\nRationale: {rationale}\nProblem:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)

problem = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(problem.split("Problem:")[-1].strip())

Training Details

Training Data

Seed Dataset:

253 concept-rationale-problem triples from:
- AIME 2024/2025
- GSM8K
- Math500
Format: (concepts: List[str], rationale: str, problem: str)

Training Process:

Cold Start: Warm-start both models via Maximum Likelihood Estimation (MLE) on seed dataset
EM Loop: Iterative refinement through 10 EM iterations
- Each iteration generates K=8 rationale candidates per problem
- Selects best candidate based on reward function
- Fine-tunes both models on selected triples

Training Procedure

Preprocessing

Tokenization: Left-padding, max_length=512 (EM loop) / 2048 (cold start)
Format: Concepts: c1 | c2 | ...\nRationale: z\nProblem: x
Masked cross-entropy loss (only tokens after "Problem:" keyword)

Training Hyperparameters

Training regime: bfloat16 mixed precision
LoRA Configuration:
- r=64 (rank)
- lora_alpha=16
- lora_dropout=0.05
- Target modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
EM Loop:
- Batch size: 16
- K samples: 8 rationale candidates per problem
- Learning rate: 2e-5 (inferred from Trainer defaults)
- Epochs per M-step: 1
Reward Function:
```
R(c,x,z) = log p(x|z,c) + log p(z|c)
```
Where log probabilities are computed as negative cross-entropy loss.

Speeds, Sizes, Times

Model Size: ~7B parameters (base) + ~0.02B (LoRA adapters)
Hardware: H200 GPU (141 GB VRAM)
Training Time: ~X hours per EM iteration (depending on dataset size)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Seed dataset: 253 triples (training/validation split if applicable)
Generated data: Synthetic problems created during EM iterations

Metrics

Reward Score: Average reward per iteration (R(c,x,z) = log p(x|z,c) + log p(z|c))
Training Loss: Cross-entropy loss on selected triples
Rationale Quality: Measured through reward-based selection

Results

Training progress is monitored via WandB:

E-step reward statistics (avg, max, min)
M-step training losses for both models
Number of triples selected per iteration

Note: This is an ongoing training process. Final evaluation results will be updated upon completion of all EM iterations.

Summary

The model is trained using PromptCoT 2.0's EM algorithm, which iteratively improves both problem generation (pθ) and rationale generation (qφ) capabilities through reward-based selection.

Model Examination [optional]

[More Information Needed]

Technical Specifications

Model Architecture and Objective

Base Architecture: Qwen2.5-7B-Instruct (Transformer decoder)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Objective: Causal language modeling with masked cross-entropy
Task: Generate problems x given concepts c and rationale z

Compute Infrastructure

Hardware

Training: NVIDIA H200 GPU (141 GB VRAM)
Inference: Compatible with any GPU supporting bfloat16

Software

Framework: PyTorch 2.0+
Libraries:
- transformers
- peft (v0.17.1+)
- datasets
- wandb (for logging)
CUDA: Compatible with CUDA 11.8+

Citation

If you use this model, please cite the PromptCoT 2.0 paper:

BibTeX:

@article{zhao2025promptcot2,
  title={PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author={Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2509.19894},
  year={2025}
}

APA: Zhao, X., Wu, W., Guan, J., Gong, Z., & Kong, L. (2025). PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning. arXiv preprint arXiv:2509.19894.

Paper Link: https://arxiv.org/abs/2509.19894

Framework versions

PEFT 0.17.1
transformers 4.40.0+
torch 2.0+

Downloads last month: 196

Model tree for PanzerBread/PromptCoT

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(727)

this model