|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen2.5-3B-Instruct |
|
tags: |
|
- qwen2.5 |
|
- ppo |
|
- rlhf |
|
- metamath |
|
- math |
|
- reasoning |
|
- verl |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Qwen2.5-3B-UFO-1turn |
|
|
|
This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning. |
|
|
|
Github: https://github.com/lichengliu03/unary-feedback |
|
|
|
Website: https://unary-feedback.github.io/ |
|
|
|
## Model Info |
|
|
|
- **Base model**: Qwen/Qwen2.5-3B-Instruct |
|
- **Training method**: PPO (full-parameter fine-tuning, not LoRA) |
|
- **Training data**: MATH_MetaMathQA |
|
- **Training steps**: 200 steps |
|
- **Framework**: VERL |
|
- **Tensor parallel**: 2x GPU distributed training |
|
- **Model size**: ~6GB |
|
|
|
## Training Config |
|
|
|
- **Micro Batch Size**: 1 per GPU |
|
- **PPO Mini Batch Size**: 8 |
|
- **Actor Learning Rate**: auto |
|
- **Critic Learning Rate**: auto |
|
- **KL Penalty**: 0.001 |
|
- **Clip Ratio**: 0.2-0.28 |
|
- **Temperature**: 1.0 (train), 0.5 (eval) |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"LichengLiu03/qwen2.5-3b-ppo-metamath-full", |
|
torch_dtype=torch.float16, |
|
device_map="auto" |
|
) |
|
|
|
# Example math problem |
|
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
# Generate answer |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=512, |
|
temperature=0.7, |
|
do_sample=True, |
|
pad_token_id=tokenizer.eos_token_id |
|
) |
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
## Features |
|
|
|
This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves: |
|
|
|
- β
Math problem understanding |
|
- β
Logical reasoning accuracy |
|
- β
Clarity of solution steps |
|
- β
Calculation accuracy |
|
|
|
## Technical Details |
|
|
|
- **Tensor parallel training**: 2 GPUs, distributed |
|
- **Memory optimization**: gradient checkpointing and mixed precision |
|
- **Reward modeling**: based on MetaMathQA correctness and reasoning quality |
|
- **Policy optimization**: PPO for stable training |
|
|
|
## Limitations |
|
|
|
- Mainly optimized for mathematical reasoning |
|
- May not perform as well on general tasks |
|
- Recommended for math, logic, and reasoning tasks |
|
|
|
## License |
|
|
|
This model is licensed under Apache 2.0. |
|
|