---
library_name: transformers
license: mit
datasets:
- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B
---

# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K

This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.

---

## 🧠 Model Objective

The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:

```text
<think>...reasoning steps...</think><answer>...final answer...</answer>
```

---

## 📚 Dataset

Training was done on:

- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)  
This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.

---

## 🛠️ Training Methodology

### 🔄 LoRA Configuration

The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:

```python
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)
```

---

### 🎯 Reward Functions for GRPO

We used two custom reward functions to guide generation during training:

- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.

```python
def reward_len(completions, **kwargs):
    return [-abs(50 - len(completion)) for completion in completions]
```

- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`

```python
import re

def reward_format(completions, **kwargs):
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
```

---

### ⚙️ GRPO Training Configuration

```python
training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    report_to="none",
    remove_unused_columns=False,
    logging_steps=1,
)
```

- **Mixed precision training**: Enabled (`bf16=True`)
- **Optimizer**: `adamw_8bit` for memory-efficient optimization
- **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
- **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.

---

## 💡 Use Cases

- Educational tutoring systems for math reasoning
- Automated math assistants and solvers
- Research on structured reasoning and format-aware generation

---

## 🧪 Limitations

This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.

---

## 📜 License

This model is released under the **MIT License**.