--- license: apache-2.0 base_model: Qwen/Qwen2.5-3B-Instruct tags: - qwen2.5 - ppo - rlhf - metamath - math - reasoning - verl pipeline_tag: text-generation --- # Qwen2.5-3B-UFO-1turn This model is based on **Qwen2.5-3B-Instruct** and trained with **PPO (Proximal Policy Optimization)** on the **MetaMathQA** dataset for mathematical reasoning. Github: https://github.com/lichengliu03/unary-feedback Website: https://unary-feedback.github.io/ ## Model Info - **Base model**: Qwen/Qwen2.5-3B-Instruct - **Training method**: PPO (full-parameter fine-tuning, not LoRA) - **Training data**: MATH_MetaMathQA - **Training steps**: 200 steps - **Framework**: VERL - **Tensor parallel**: 2x GPU distributed training - **Model size**: ~6GB ## Training Config - **Micro Batch Size**: 1 per GPU - **PPO Mini Batch Size**: 8 - **Actor Learning Rate**: auto - **Critic Learning Rate**: auto - **KL Penalty**: 0.001 - **Clip Ratio**: 0.2-0.28 - **Temperature**: 1.0 (train), 0.5 (eval) ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full") model = AutoModelForCausalLM.from_pretrained( "LichengLiu03/qwen2.5-3b-ppo-metamath-full", torch_dtype=torch.float16, device_map="auto" ) # Example math problem prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?" inputs = tokenizer(prompt, return_tensors="pt") # Generate answer with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Features This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves: - ✅ Math problem understanding - ✅ Logical reasoning accuracy - ✅ Clarity of solution steps - ✅ Calculation accuracy ## Technical Details - **Tensor parallel training**: 2 GPUs, distributed - **Memory optimization**: gradient checkpointing and mixed precision - **Reward modeling**: based on MetaMathQA correctness and reasoning quality - **Policy optimization**: PPO for stable training ## Limitations - Mainly optimized for mathematical reasoning - May not perform as well on general tasks - Recommended for math, logic, and reasoning tasks ## License This model is licensed under Apache 2.0.