--- library_name: transformers license: mit datasets: - eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1 language: - en base_model: - Qwen/Qwen2.5-0.5B --- # Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals. --- ## ๐Ÿง  Model Objective The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format: ```text ...reasoning steps......final answer... ``` --- ## ๐Ÿ“š Dataset Training was done on: - [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1) This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning. --- ## ๐Ÿ› ๏ธ Training Methodology ### ๐Ÿ”„ LoRA Configuration The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation: ```python from peft import get_peft_model, LoraConfig, TaskType lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.1, bias="none", ) model = get_peft_model(model, lora_config) ``` --- ### ๐ŸŽฏ Reward Functions for GRPO We used two custom reward functions to guide generation during training: - `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length. ```python def reward_len(completions, **kwargs): return [-abs(50 - len(completion)) for completion in completions] ``` - `reward_format`: Enforces formatting in the style `......` ```python import re def reward_format(completions, **kwargs): pattern = r"^.*?.*?$" return [1.0 if re.match(pattern, c) else 0.0 for c in completions] ``` --- ### โš™๏ธ GRPO Training Configuration ```python training_args = GRPOConfig( output_dir="GRPO", learning_rate=2e-5, per_device_train_batch_size=8, gradient_accumulation_steps=1, max_prompt_length=512, max_completion_length=96, num_generations=8, optim="adamw_8bit", num_train_epochs=1, bf16=True, report_to="none", remove_unused_columns=False, logging_steps=1, ) ``` - **Mixed precision training**: Enabled (`bf16=True`) - **Optimizer**: `adamw_8bit` for memory-efficient optimization - **Sampling**: 8 candidate generations per prompt (`num_generations=8`) - **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation. --- ## ๐Ÿ’ก Use Cases - Educational tutoring systems for math reasoning - Automated math assistants and solvers - Research on structured reasoning and format-aware generation --- ## ๐Ÿงช Limitations This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format. --- ## ๐Ÿ“œ License This model is released under the **MIT License**.