Transformers
Safetensors
English

Model Card for Qwen2.5-0.5B Fine-Tuned on Enhanced GSM8K

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient training and the GRPO (Guided Reward Preference Optimization) algorithm to optimize outputs based on reward-driven signals.


🧠 Model Objective

The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:

<think>...reasoning steps...</think><answer>...final answer...</answer>

πŸ“š Dataset

Training was done on:


πŸ› οΈ Training Methodology

πŸ”„ LoRA Configuration

The model was fine-tuned using LoRA on attention projection layers (q_proj, v_proj) for parameter-efficient adaptation:

from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)

🎯 Reward Functions for GRPO

We used two custom reward functions to guide generation during training:

  • reward_len: Encourages generations close to 50 tokens for optimal reasoning length.
def reward_len(completions, **kwargs):
    return [-abs(50 - len(completion)) for completion in completions]
  • reward_format: Enforces formatting in the style <think>...</think><answer>...</answer>
import re

def reward_format(completions, **kwargs):
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]

βš™οΈ GRPO Training Configuration

training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    report_to="none",
    remove_unused_columns=False,
    logging_steps=1,
)
  • Mixed precision training: Enabled (bf16=True)
  • Optimizer: adamw_8bit for memory-efficient optimization
  • Sampling: 8 candidate generations per prompt (num_generations=8)
  • Reward-guided selection: Candidates are scored using reward_len and reward_format before backpropagation.

πŸ’‘ Use Cases

  • Educational tutoring systems for math reasoning
  • Automated math assistants and solvers
  • Research on structured reasoning and format-aware generation

πŸ§ͺ Limitations

This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.


πŸ“œ License

This model is released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for eagle0504/fine-tuned-Qwen2.5-0.5B-openai-gsm8k-enhanced-v1

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(165)
this model

Dataset used to train eagle0504/fine-tuned-Qwen2.5-0.5B-openai-gsm8k-enhanced-v1