LichengLiu03
/

Qwen2.5-3B-UFO-1turn

Text Generation

Model card Files Files and versions Community

Qwen2.5-3B-UFO-1turn / README.md

LichengLiu03's picture

Update README.md

7e4cae0 verified about 1 month ago

|

history blame contribute delete

2.53 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-3B-Instruct
	tags:
	- qwen2.5
	- ppo
	- rlhf
	- metamath
	- math
	- reasoning
	- verl
	pipeline_tag: text-generation
	---

	# Qwen2.5-3B-UFO-1turn

	This model is based on Qwen2.5-3B-Instruct and trained with PPO (Proximal Policy Optimization) on the MetaMathQA dataset for mathematical reasoning.

	Github: https://github.com/lichengliu03/unary-feedback

	Website: https://unary-feedback.github.io/

	## Model Info

	- Base model: Qwen/Qwen2.5-3B-Instruct
	- Training method: PPO (full-parameter fine-tuning, not LoRA)
	- Training data: MATH_MetaMathQA
	- Training steps: 200 steps
	- Framework: VERL
	- Tensor parallel: 2x GPU distributed training
	- Model size: ~6GB

	## Training Config

	- Micro Batch Size: 1 per GPU
	- PPO Mini Batch Size: 8
	- Actor Learning Rate: auto
	- Critic Learning Rate: auto
	- KL Penalty: 0.001
	- Clip Ratio: 0.2-0.28
	- Temperature: 1.0 (train), 0.5 (eval)

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("LichengLiu03/qwen2.5-3b-ppo-metamath-full")
	model = AutoModelForCausalLM.from_pretrained(
	"LichengLiu03/qwen2.5-3b-ppo-metamath-full",
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Example math problem
	prompt = "Solve this math problem: If a circle has a radius of 5cm, what is its area?"
	inputs = tokenizer(prompt, return_tensors="pt")

	# Generate answer
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Features

	This model is optimized for mathematical reasoning with PPO, and compared to the base model, it improves:

	- ✅ Math problem understanding
	- ✅ Logical reasoning accuracy
	- ✅ Clarity of solution steps
	- ✅ Calculation accuracy

	## Technical Details

	- Tensor parallel training: 2 GPUs, distributed
	- Memory optimization: gradient checkpointing and mixed precision
	- Reward modeling: based on MetaMathQA correctness and reasoning quality
	- Policy optimization: PPO for stable training

	## Limitations

	- Mainly optimized for mathematical reasoning
	- May not perform as well on general tasks
	- Recommended for math, logic, and reasoning tasks

	## License

	This model is licensed under Apache 2.0.