LichengLiu03's picture
Update README.md
7e4cae0 verified
---
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- qwen2.5
- ppo
- rlhf
- metamath
- math
- reasoning
- verl
pipeline_tag: text-generation
---
# Qwen2.5-3B-UFO-1turn
This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning.
Github: https://github.com/lichengliu03/unary-feedback
Website: https://unary-feedback.github.io/
## Model Info
- **Base model**: Qwen/Qwen2.5-3B-Instruct
- **Training method**: PPO (full-parameter fine-tuning, not LoRA)
- **Training data**: MATH_MetaMathQA
- **Training steps**: 200 steps
- **Framework**: VERL
- **Tensor parallel**: 2x GPU distributed training
- **Model size**: ~6GB
## Training Config
- **Micro Batch Size**: 1 per GPU
- **PPO Mini Batch Size**: 8
- **Actor Learning Rate**: auto
- **Critic Learning Rate**: auto
- **KL Penalty**: 0.001
- **Clip Ratio**: 0.2-0.28
- **Temperature**: 1.0 (train), 0.5 (eval)
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
"LichengLiu03/qwen2.5-3b-ppo-metamath-full",
torch_dtype=torch.float16,
device_map="auto"
)
# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate answer
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Features
This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:
- βœ… Math problem understanding
- βœ… Logical reasoning accuracy
- βœ… Clarity of solution steps
- βœ… Calculation accuracy
## Technical Details
- **Tensor parallel training**: 2 GPUs, distributed
- **Memory optimization**: gradient checkpointing and mixed precision
- **Reward modeling**: based on MetaMathQA correctness and reasoning quality
- **Policy optimization**: PPO for stable training
## Limitations
- Mainly optimized for mathematical reasoning
- May not perform as well on general tasks
- Recommended for math, logic, and reasoning tasks
## License
This model is licensed under Apache 2.0.