eagle0504
/

fine-tuned-Qwen2.5-0.5B-openai-gsm8k-enhanced-v1

Transformers

Safetensors

English

Model card Files Files and versions Community

Improve language tag

by lbourdois - opened Apr 27

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+138

-126

Files changed (1) hide show

README.md +138 -126

README.md CHANGED Viewed

@@ -1,126 +1,138 @@
----
-library_name: transformers
-license: mit
-datasets:
-- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-0.5B
----
-# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K
-This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.
----
-## 🧠 Model Objective
-The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
-```text
-<think>...reasoning steps...</think><answer>...final answer...</answer>
-```
----
-## 📚 Dataset
-Training was done on:
-- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
-This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
----
-## 🛠️ Training Methodology
-### 🔄 LoRA Configuration
-The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:
-```python
-from peft import get_peft_model, LoraConfig, TaskType
-lora_config = LoraConfig(
-    task_type=TaskType.CAUSAL_LM,
-    r=8,
-    lora_alpha=16,
-    target_modules=["q_proj", "v_proj"],
-    lora_dropout=0.1,
-    bias="none",
-)
-model = get_peft_model(model, lora_config)
-```
----
-### 🎯 Reward Functions for GRPO
-We used two custom reward functions to guide generation during training:
-- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.
-```python
-def reward_len(completions, **kwargs):
-    return [-abs(50 - len(completion)) for completion in completions]
-```
-- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`
-```python
-import re
-def reward_format(completions, **kwargs):
-    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
-    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
-```
----
-### ⚙️ GRPO Training Configuration
-```python
-training_args = GRPOConfig(
-    output_dir="GRPO",
-    learning_rate=2e-5,
-    per_device_train_batch_size=8,
-    gradient_accumulation_steps=1,
-    max_prompt_length=512,
-    max_completion_length=96,
-    num_generations=8,
-    optim="adamw_8bit",
-    num_train_epochs=1,
-    bf16=True,
-    report_to="none",
-    remove_unused_columns=False,
-    logging_steps=1,
-)
-```
-- **Mixed precision training**: Enabled (`bf16=True`)
-- **Optimizer**: `adamw_8bit` for memory-efficient optimization
-- **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
-- **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.
----
-## 💡 Use Cases
-- Educational tutoring systems for math reasoning
-- Automated math assistants and solvers
-- Research on structured reasoning and format-aware generation
----
-## 🧪 Limitations
-This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
----
-## 📜 License
-This model is released under the **MIT License**.

+---
+library_name: transformers
+license: mit
+datasets:
+- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-0.5B
+---
+# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K
+This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.
+---
+## 🧠 Model Objective
+The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
+```text
+<think>...reasoning steps...</think><answer>...final answer...</answer>
+```
+---
+## 📚 Dataset
+Training was done on:
+- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
+This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
+---
+## 🛠️ Training Methodology
+### 🔄 LoRA Configuration
+The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:
+```python
+from peft import get_peft_model, LoraConfig, TaskType
+lora_config = LoraConfig(
+    task_type=TaskType.CAUSAL_LM,
+    r=8,
+    lora_alpha=16,
+    target_modules=["q_proj", "v_proj"],
+    lora_dropout=0.1,
+    bias="none",
+)
+model = get_peft_model(model, lora_config)
+```
+---
+### 🎯 Reward Functions for GRPO
+We used two custom reward functions to guide generation during training:
+- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.
+```python
+def reward_len(completions, **kwargs):
+    return [-abs(50 - len(completion)) for completion in completions]
+```
+- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`
+```python
+import re
+def reward_format(completions, **kwargs):
+    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
+    return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
+```
+---
+### ⚙️ GRPO Training Configuration
+```python
+training_args = GRPOConfig(
+    output_dir="GRPO",
+    learning_rate=2e-5,
+    per_device_train_batch_size=8,
+    gradient_accumulation_steps=1,
+    max_prompt_length=512,
+    max_completion_length=96,
+    num_generations=8,
+    optim="adamw_8bit",
+    num_train_epochs=1,
+    bf16=True,
+    report_to="none",
+    remove_unused_columns=False,
+    logging_steps=1,
+)
+```
+- **Mixed precision training**: Enabled (`bf16=True`)
+- **Optimizer**: `adamw_8bit` for memory-efficient optimization
+- **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
+- **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.
+---
+## 💡 Use Cases
+- Educational tutoring systems for math reasoning
+- Automated math assistants and solvers
+- Research on structured reasoning and format-aware generation
+---
+## 🧪 Limitations
+This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
+---
+## 📜 License
+This model is released under the **MIT License**.