File size: 2,532 Bytes
39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 91662d5 7e4cae0 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 39da63c 5fd7862 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- qwen2.5
- ppo
- rlhf
- metamath
- math
- reasoning
- verl
pipeline_tag: text-generation
---
# Qwen2.5-3B-UFO-1turn
This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning.
Github: https://github.com/lichengliu03/unary-feedback
Website: https://unary-feedback.github.io/
## Model Info
- **Base model**: Qwen/Qwen2.5-3B-Instruct
- **Training method**: PPO (full-parameter fine-tuning, not LoRA)
- **Training data**: MATH_MetaMathQA
- **Training steps**: 200 steps
- **Framework**: VERL
- **Tensor parallel**: 2x GPU distributed training
- **Model size**: ~6GB
## Training Config
- **Micro Batch Size**: 1 per GPU
- **PPO Mini Batch Size**: 8
- **Actor Learning Rate**: auto
- **Critic Learning Rate**: auto
- **KL Penalty**: 0.001
- **Clip Ratio**: 0.2-0.28
- **Temperature**: 1.0 (train), 0.5 (eval)
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
model = AutoModelForCausalLM.from_pretrained(
"LichengLiu03/qwen2.5-3b-ppo-metamath-full",
torch_dtype=torch.float16,
device_map="auto"
)
# Example math problem
prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate answer
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Features
This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:
- ✅ Math problem understanding
- ✅ Logical reasoning accuracy
- ✅ Clarity of solution steps
- ✅ Calculation accuracy
## Technical Details
- **Tensor parallel training**: 2 GPUs, distributed
- **Memory optimization**: gradient checkpointing and mixed precision
- **Reward modeling**: based on MetaMathQA correctness and reasoning quality
- **Policy optimization**: PPO for stable training
## Limitations
- Mainly optimized for mathematical reasoning
- May not perform as well on general tasks
- Recommended for math, logic, and reasoning tasks
## License
This model is licensed under Apache 2.0.
|