Model Card for Qwen2.5-0.5B
Fine-Tuned on Enhanced GSM8K
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B
designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient training and the GRPO (Guided Reward Preference Optimization) algorithm to optimize outputs based on reward-driven signals.
π§ Model Objective
The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
<think>...reasoning steps...</think><answer>...final answer...</answer>
π Dataset
Training was done on:
eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
π οΈ Training Methodology
π LoRA Configuration
The model was fine-tuned using LoRA on attention projection layers (q_proj
, v_proj
) for parameter-efficient adaptation:
from peft import get_peft_model, LoraConfig, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, lora_config)
π― Reward Functions for GRPO
We used two custom reward functions to guide generation during training:
reward_len
: Encourages generations close to 50 tokens for optimal reasoning length.
def reward_len(completions, **kwargs):
return [-abs(50 - len(completion)) for completion in completions]
reward_format
: Enforces formatting in the style<think>...</think><answer>...</answer>
import re
def reward_format(completions, **kwargs):
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
βοΈ GRPO Training Configuration
training_args = GRPOConfig(
output_dir="GRPO",
learning_rate=2e-5,
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
max_prompt_length=512,
max_completion_length=96,
num_generations=8,
optim="adamw_8bit",
num_train_epochs=1,
bf16=True,
report_to="none",
remove_unused_columns=False,
logging_steps=1,
)
- Mixed precision training: Enabled (
bf16=True
) - Optimizer:
adamw_8bit
for memory-efficient optimization - Sampling: 8 candidate generations per prompt (
num_generations=8
) - Reward-guided selection: Candidates are scored using
reward_len
andreward_format
before backpropagation.
π‘ Use Cases
- Educational tutoring systems for math reasoning
- Automated math assistants and solvers
- Research on structured reasoning and format-aware generation
π§ͺ Limitations
This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
π License
This model is released under the MIT License.
Model tree for eagle0504/fine-tuned-Qwen2.5-0.5B-openai-gsm8k-enhanced-v1
Base model
Qwen/Qwen2.5-0.5B