PromptCoT 2.0 - Prompt Model (pθ)

This is the Prompt Model (pθ) from the PromptCoT 2.0 implementation, trained using Expectation-Maximization (EM) algorithm to generate challenging mathematical problems given concepts and rationales.

Model Details

Model Description

This model is part of a dual-model system implementing PromptCoT 2.0:

  • pθ (Prompt Model): Generates problems x given concepts c and rationale zp(x|z,c)
  • qφ (Rationale Model): Generates rationales z given concepts c and problem xq(z|c,x)

The models are trained iteratively using an EM loop:

  1. E-step: Generate K=8 rationale candidates, compute rewards, select best
  2. M-step: Fine-tune both models on selected (concept, rationale, problem) triples
  • Developed by: Krzysztof Staroń
  • Model type: LoRA fine-tuned Causal Language Model
  • Language(s): English (mathematical reasoning)
  • License: Apache 2.0 (inherited from Qwen2.5-7B)
  • Finetuned from: Qwen/Qwen2.5-7B

Model Sources

Uses

Direct Use

This model is designed to generate challenging mathematical problems given:

  • Input format: Concepts: c1 | c2 | ...\nRationale: [rationale text]\nProblem:
  • Output: Mathematical problem text

Example:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

concepts = "algebra | quadratic equations"
# It will think about the concepts, and then generate a problem after "Problem: "
prompt = f"Concepts: {concepts}\nRationale:"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
problem = tokenizer.decode(outputs[0], skip_special_tokens=True)

Downstream Use

This model is part of the PromptCoT 2.0 EM training loop. Use it together with the rationale model (qφ) to:

  • Generate synthetic training data for mathematical reasoning
  • Improve problem-solving capabilities through iterative refinement
  • Create challenging problem sets for educational purposes

Out-of-Scope Use

This model is specialized for mathematical reasoning and may not perform well for:

  • General conversational tasks
  • Non-mathematical problem generation
  • Tasks requiring external knowledge beyond mathematical concepts

Bias, Risks, and Limitations

Known Limitations

  • Domain Specificity: This model is trained specifically for mathematical reasoning and may not generalize well to other domains
  • Training Data Bias: The model inherits biases from the seed dataset (AIME, GSM8K, Math500), which may reflect specific mathematical problem styles
  • EM Convergence: The EM algorithm may converge to local optima, depending on initialization and hyperparameters
  • Generated Quality: Generated problems may require manual validation for correctness and appropriateness

Recommendations

Users should:

  1. Validate Outputs: Always verify generated problems for mathematical correctness
  2. Use with Rationale Model: This model works best when paired with the rationale model (qφ) in the full EM loop
  3. Monitor Training: Check WandB logs for reward trends and training stability
  4. Iterative Refinement: The EM process requires multiple iterations for best results

How to Get Started with the Model

Installation

pip install transformers peft torch

Loading the Model

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "PanzerBread/promptcot-p")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Generating Problems

concepts = "algebra | quadratic equations | factoring"
rationale = "To solve this problem, we need to factor the quadratic equation and find its roots..."

prompt = f"Concepts: {concepts}\nRationale: {rationale}\nProblem:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)

problem = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(problem.split("Problem:")[-1].strip())

Training Details

Training Data

Seed Dataset:

  • 253 concept-rationale-problem triples from:
    • AIME 2024/2025
    • GSM8K
    • Math500
  • Format: (concepts: List[str], rationale: str, problem: str)

Training Process:

  1. Cold Start: Warm-start both models via Maximum Likelihood Estimation (MLE) on seed dataset
  2. EM Loop: Iterative refinement through 10 EM iterations
    • Each iteration generates K=8 rationale candidates per problem
    • Selects best candidate based on reward function
    • Fine-tunes both models on selected triples

Training Procedure

Preprocessing

  • Tokenization: Left-padding, max_length=512 (EM loop) / 2048 (cold start)
  • Format: Concepts: c1 | c2 | ...\nRationale: z\nProblem: x
  • Masked cross-entropy loss (only tokens after "Problem:" keyword)

Training Hyperparameters

  • Training regime: bfloat16 mixed precision
  • LoRA Configuration:
    • r=64 (rank)
    • lora_alpha=16
    • lora_dropout=0.05
    • Target modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
  • EM Loop:
    • Batch size: 16
    • K samples: 8 rationale candidates per problem
    • Learning rate: 2e-5 (inferred from Trainer defaults)
    • Epochs per M-step: 1
  • Reward Function:
    R(c,x,z) = log p(x|z,c) + log p(z|c)
    
    Where log probabilities are computed as negative cross-entropy loss.

Speeds, Sizes, Times

  • Model Size: ~7B parameters (base) + ~0.02B (LoRA adapters)
  • Hardware: H200 GPU (141 GB VRAM)
  • Training Time: ~X hours per EM iteration (depending on dataset size)

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Seed dataset: 253 triples (training/validation split if applicable)
  • Generated data: Synthetic problems created during EM iterations

Metrics

  • Reward Score: Average reward per iteration (R(c,x,z) = log p(x|z,c) + log p(z|c))
  • Training Loss: Cross-entropy loss on selected triples
  • Rationale Quality: Measured through reward-based selection

Results

Training progress is monitored via WandB:

  • E-step reward statistics (avg, max, min)
  • M-step training losses for both models
  • Number of triples selected per iteration

Note: This is an ongoing training process. Final evaluation results will be updated upon completion of all EM iterations.

Summary

The model is trained using PromptCoT 2.0's EM algorithm, which iteratively improves both problem generation (pθ) and rationale generation (qφ) capabilities through reward-based selection.

Model Examination [optional]

[More Information Needed]

Technical Specifications

Model Architecture and Objective

  • Base Architecture: Qwen2.5-7B-Instruct (Transformer decoder)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Objective: Causal language modeling with masked cross-entropy
  • Task: Generate problems x given concepts c and rationale z

Compute Infrastructure

Hardware

  • Training: NVIDIA H200 GPU (141 GB VRAM)
  • Inference: Compatible with any GPU supporting bfloat16

Software

  • Framework: PyTorch 2.0+
  • Libraries:
    • transformers
    • peft (v0.17.1+)
    • datasets
    • wandb (for logging)
  • CUDA: Compatible with CUDA 11.8+

Citation

If you use this model, please cite the PromptCoT 2.0 paper:

BibTeX:

@article{zhao2025promptcot2,
  title={PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author={Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2509.19894},
  year={2025}
}

APA: Zhao, X., Wu, W., Guan, J., Gong, Z., & Kong, L. (2025). PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning. arXiv preprint arXiv:2509.19894.

Paper Link: https://arxiv.org/abs/2509.19894

Framework versions

  • PEFT 0.17.1
  • transformers 4.40.0+
  • torch 2.0+
Downloads last month
196
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PanzerBread/PromptCoT

Base model

Qwen/Qwen2.5-7B
Adapter
(727)
this model