Transformers
Safetensors
English
File size: 3,682 Bytes
221470d
 
ddbcf79
 
 
 
 
 
 
221470d
 
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
 
 
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
 
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
 
221470d
ddbcf79
 
 
 
 
 
 
 
221470d
ddbcf79
 
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
 
 
 
221470d
ddbcf79
221470d
ddbcf79
 
221470d
ddbcf79
 
 
 
221470d
ddbcf79
221470d
ddbcf79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
 
 
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
221470d
ddbcf79
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
library_name: transformers
license: mit
datasets:
- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B
---

# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K

This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.

---

## 🧠 Model Objective

The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:

```text
<think>...reasoning steps...</think><answer>...final answer...</answer>
```

---

## πŸ“š Dataset

Training was done on:

- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)  
This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.

---

## πŸ› οΈ Training Methodology

### πŸ”„ LoRA Configuration

The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:

```python
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)
```

---

### 🎯 Reward Functions for GRPO

We used two custom reward functions to guide generation during training:

- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.

```python
def reward_len(completions, **kwargs):
    return [-abs(50 - len(completion)) for completion in completions]
```

- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`

```python
import re

def reward_format(completions, **kwargs):
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
```

---

### βš™οΈ GRPO Training Configuration

```python
training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    report_to="none",
    remove_unused_columns=False,
    logging_steps=1,
)
```

- **Mixed precision training**: Enabled (`bf16=True`)
- **Optimizer**: `adamw_8bit` for memory-efficient optimization
- **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
- **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.

---

## πŸ’‘ Use Cases

- Educational tutoring systems for math reasoning
- Automated math assistants and solvers
- Research on structured reasoning and format-aware generation

---

## πŸ§ͺ Limitations

This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.

---

## πŸ“œ License

This model is released under the **MIT License**.