File size: 3,682 Bytes
221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 221470d ddbcf79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
library_name: transformers
license: mit
datasets:
- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
language:
- en
base_model:
- Qwen/Qwen2.5-0.5B
---
# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K
This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.
---
## π§ Model Objective
The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
```text
<think>...reasoning steps...</think><answer>...final answer...</answer>
```
---
## π Dataset
Training was done on:
- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
---
## π οΈ Training Methodology
### π LoRA Configuration
The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:
```python
from peft import get_peft_model, LoraConfig, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, lora_config)
```
---
### π― Reward Functions for GRPO
We used two custom reward functions to guide generation during training:
- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.
```python
def reward_len(completions, **kwargs):
return [-abs(50 - len(completion)) for completion in completions]
```
- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`
```python
import re
def reward_format(completions, **kwargs):
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
```
---
### βοΈ GRPO Training Configuration
```python
training_args = GRPOConfig(
output_dir="GRPO",
learning_rate=2e-5,
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
max_prompt_length=512,
max_completion_length=96,
num_generations=8,
optim="adamw_8bit",
num_train_epochs=1,
bf16=True,
report_to="none",
remove_unused_columns=False,
logging_steps=1,
)
```
- **Mixed precision training**: Enabled (`bf16=True`)
- **Optimizer**: `adamw_8bit` for memory-efficient optimization
- **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
- **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.
---
## π‘ Use Cases
- Educational tutoring systems for math reasoning
- Automated math assistants and solvers
- Research on structured reasoning and format-aware generation
---
## π§ͺ Limitations
This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
---
## π License
This model is released under the **MIT License**.
|