metadata

license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
  - qwen2.5
  - ppo
  - rlhf
  - metamath
  - math
  - reasoning
  - verl
pipeline_tag: text-generation

Qwen2.5-3B-UFO-1turn

This model is based on Qwen2.5-3B-Instruct and trained with PPO (Proximal Policy Optimization) on the MetaMathQA dataset for mathematical reasoning.

Github: https://github.com/lichengliu03/unary-feedback

Website: https://unary-feedback.github.io/

Model Info

Base model: Qwen/Qwen2.5-3B-Instruct
Training method: PPO (full-parameter fine-tuning, not LoRA)
Training data: MATH_MetaMathQA
Training steps: 200 steps
Framework: VERL
Tensor parallel: 2x GPU distributed training
Model size: ~6GB

Training Config

Micro Batch Size: 1 per GPU
PPO Mini Batch Size: 8
Actor Learning Rate: auto
Critic Learning Rate: auto
KL Penalty: 0.001
Clip Ratio: 0.2-0.28
Temperature: 1.0 (train), 0.5 (eval)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
    "LichengLiu03/qwen2.5-3b-ppo-metamath-full",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Features

This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:

✅ Math problem understanding
✅ Logical reasoning accuracy
✅ Clarity of solution steps
✅ Calculation accuracy

Technical Details

Tensor parallel training: 2 GPUs, distributed
Memory optimization: gradient checkpointing and mixed precision
Reward modeling: based on MetaMathQA correctness and reasoning quality
Policy optimization: PPO for stable training

Limitations

Mainly optimized for mathematical reasoning
May not perform as well on general tasks
Recommended for math, logic, and reasoning tasks

License

This model is licensed under Apache 2.0.