Update README.md

ddbcf79 verified 3 months ago

3.68 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-0.5B
	---

	# Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K

	This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using LoRA (Low-Rank Adaptation) for parameter-efficient training and the GRPO (Guided Reward Preference Optimization) algorithm to optimize outputs based on reward-driven signals.

	---

	## 🧠 Model Objective

	The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:

	```text
	<think>...reasoning steps...</think><answer>...final answer...</answer>
	```

	---

	## 📚 Dataset

	Training was done on:

	- [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
	This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.

	---

	## 🛠️ Training Methodology

	### 🔄 LoRA Configuration

	The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:

	```python
	from peft import get_peft_model, LoraConfig, TaskType

	lora_config = LoraConfig(
	task_type=TaskType.CAUSAL_LM,
	r=8,
	lora_alpha=16,
	target_modules=["q_proj", "v_proj"],
	lora_dropout=0.1,
	bias="none",
	)

	model = get_peft_model(model, lora_config)
	```

	---

	### 🎯 Reward Functions for GRPO

	We used two custom reward functions to guide generation during training:

	- `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.

	```python
	def reward_len(completions, **kwargs):
	return [-abs(50 - len(completion)) for completion in completions]
	```

	- `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`

	```python
	import re

	def reward_format(completions, **kwargs):
	pattern = r"^<think>.?</think><answer>.?</answer>$"
	return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
	```

	---

	### ⚙️ GRPO Training Configuration

	```python
	training_args = GRPOConfig(
	output_dir="GRPO",
	learning_rate=2e-5,
	per_device_train_batch_size=8,
	gradient_accumulation_steps=1,
	max_prompt_length=512,
	max_completion_length=96,
	num_generations=8,
	optim="adamw_8bit",
	num_train_epochs=1,
	bf16=True,
	report_to="none",
	remove_unused_columns=False,
	logging_steps=1,
	)
	```

	- Mixed precision training: Enabled (`bf16=True`)
	- Optimizer: `adamw_8bit` for memory-efficient optimization
	- Sampling: 8 candidate generations per prompt (`num_generations=8`)
	- Reward-guided selection: Candidates are scored using `reward_len` and `reward_format` before backpropagation.

	---

	## 💡 Use Cases

	- Educational tutoring systems for math reasoning
	- Automated math assistants and solvers
	- Research on structured reasoning and format-aware generation

	---

	## 🧪 Limitations

	This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.

	---

	## 📜 License

	This model is released under the MIT License.