LichengLiu03's picture
Update README.md
7e4cae0 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
  - qwen2.5
  - ppo
  - rlhf
  - metamath
  - math
  - reasoning
  - verl
pipeline_tag: text-generation

Qwen2.5-3B-UFO-1turn

This model is based on Qwen2.5-3B-Instruct and trained with PPO (Proximal Policy Optimization) on the MetaMathQA dataset for mathematical reasoning.

Github: https://github.com/lichengliu03/unary-feedback

Website: https://unary-feedback.github.io/

Model Info

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • Training method: PPO (full-parameter fine-tuning, not LoRA)
  • Training data: MATH_MetaMathQA
  • Training steps: 200 steps
  • Framework: VERL
  • Tensor parallel: 2x GPU distributed training
  • Model size: ~6GB

Training Config

  • Micro Batch Size: 1 per GPU
  • PPO Mini Batch Size: 8
  • Actor Learning Rate: auto
  • Critic Learning Rate: auto
  • KL Penalty: 0.001
  • Clip Ratio: 0.2-0.28
  • Temperature: 1.0 (train), 0.5 (eval)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
    "LichengLiu03/qwen2.5-3b-ppo-metamath-full",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Features

This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:

  • ✅ Math problem understanding
  • ✅ Logical reasoning accuracy
  • ✅ Clarity of solution steps
  • ✅ Calculation accuracy

Technical Details

  • Tensor parallel training: 2 GPUs, distributed
  • Memory optimization: gradient checkpointing and mixed precision
  • Reward modeling: based on MetaMathQA correctness and reasoning quality
  • Policy optimization: PPO for stable training

Limitations

  • Mainly optimized for mathematical reasoning
  • May not perform as well on general tasks
  • Recommended for math, logic, and reasoning tasks

License

This model is licensed under Apache 2.0.