---
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- qwen2.5
- ppo
- rlhf
- metamath
- math
- reasoning
- verl
pipeline_tag: text-generation
---

# Qwen2.5-3B-UFO-1turn

This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning.

Github: https://github.com/lichengliu03/unary-feedback

Website: https://unary-feedback.github.io/

## Model Info

- **Base model**: Qwen/Qwen2.5-3B-Instruct
- **Training method**: PPO (full-parameter fine-tuning, not LoRA)
- **Training data**: MATH_MetaMathQA
- **Training steps**: 200 steps
- **Framework**: VERL
- **Tensor parallel**: 2x GPU distributed training
- **Model size**: ~6GB

## Training Config

- **Micro Batch Size**: 1 per GPU
- **PPO Mini Batch Size**: 8
- **Actor Learning Rate**: auto
- **Critic Learning Rate**: auto
- **KL Penalty**: 0.001
- **Clip Ratio**: 0.2-0.28
- **Temperature**: 1.0 (train), 0.5 (eval)

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
    "LichengLiu03/qwen2.5-3b-ppo-metamath-full",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Features

This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:

- ✅ Math problem understanding
- ✅ Logical reasoning accuracy
- ✅ Clarity of solution steps
- ✅ Calculation accuracy

## Technical Details

- **Tensor parallel training**: 2 GPUs, distributed
- **Memory optimization**: gradient checkpointing and mixed precision
- **Reward modeling**: based on MetaMathQA correctness and reasoning quality
- **Policy optimization**: PPO for stable training

## Limitations

- Mainly optimized for mathematical reasoning
- May not perform as well on general tasks
- Recommended for math, logic, and reasoning tasks

## License

This model is licensed under Apache 2.0.