Gemma-3 1B IT LoRA Fine-tuned with GRPO

This repository contains a LoRA (Low-Rank Adaptation) fine-tuned version of Google's Gemma-3 1B IT model, optimized using Generalized Reward-based Policy Optimization (GRPO) for enhanced instruction-following and reasoning capabilities.

Model Description

Base Model: google/gemma-3-1b-it - A 1 billion parameter instruction-tuned language model from Google.
Fine-tuning Method: LoRA combined with GRPO (Generalized Reward-based Policy Optimization).
Primary Task: Instruction tuning with reinforcement learning, focusing on mathematical reasoning and problem-solving.
Model Size: ~1.2B parameters (base + LoRA adapters).

GRPO is an advanced reinforcement learning algorithm that improves language model performance by directly optimizing for task-specific rewards without requiring a separate critic model. This approach leads to more efficient training and better alignment with desired behaviors.

Intended Use

This fine-tuned model is designed for:

Answering mathematical and reasoning questions
Following complex instructions
Generating coherent and accurate responses
Educational and tutoring applications
Research in reinforcement learning for language models

Note: This model should be used responsibly and outputs should be verified for accuracy, especially in critical applications.

How to Use

Installation

First, install the required dependencies:

pip install transformers peft torch

Loading the Model

To use this fine-tuned model, load the base model and apply the LoRA adapters:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Model identifiers
base_model_name = "google/gemma-3-1b-it"
adapter_repo_id = "Miracle12345/gemma-3-GRPO"  

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Load and apply LoRA adapters
model = PeftModel.from_pretrained(model, adapter_repo_id)

# Example inference
prompt = "Solve: What is 15 + 27?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Merging LoRA Weights (Optional)

For faster inference without the LoRA overhead:

# Merge LoRA weights into base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_gemma_3_grpo")

# Use merged model
inputs = tokenizer("Your prompt", return_tensors="pt")
outputs = merged_model.generate(**inputs)

Training Details

Training Procedure

The model was fine-tuned using the Unsloth framework with the following approach:

Base Model: Gemma-3 1B IT
Fine-tuning Method: LoRA + GRPO
Training Objective: Maximize reward-based metrics for correct reasoning and solution extraction
Prompt Format: Structured prompts with reasoning tags (<start_working_out>, <end_working_out>, <SOLUTION>)

Hyperparameters

Learning Rate: 5e-5
Batch Size: 4 (effective batch size with gradient accumulation)
Epochs: 3
LoRA Configuration:
- Rank: 16
- Alpha: 32
- Target Modules: Query and Value attention layers
Max Sequence Length: 4096 tokens
GRPO Settings:
- Reward Function: Accuracy-based with format compliance bonus
- KL Divergence Penalty: 0.01

Limitations

Domain Specificity: Best performance on mathematical and reasoning tasks; may not generalize well to other domains
Context Length: Limited to 4096 tokens due to training constraints
Numerical Precision: May occasionally produce incorrect calculations for very large numbers
Language: Primarily trained on English text
Hallucinations: Like all language models, can generate incorrect information
Bias: May reflect biases present in the training data

Citation

If you use this model in your research or applications, please cite:

@article{shao2024deepseekmath,
  title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open-source Large Language Models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. K. and Guo, Daya},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}

License

This model is licensed under the Apache License 2.0. The base Gemma model has its own licensing terms from Google.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for Miracle12345/gemma-3-GRPO

Base model

google/gemma-3-1b-pt

Finetuned

google/gemma-3-1b-it

Adapter

(138)

this model