Llama 3.1 8B Instruct - GRPO Fine-Tuned for Reasoning

Model Overview

Developed by: colesmcintosh
License: Apache-2.0
Base Model: unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit
Purpose: Enhanced reasoning capabilities for complex tasks

This is a fine-tuned version of Meta's Llama 3.1 8B Instruct model, optimized for reasoning using Group Relative Policy Optimization (GRPO). It leverages Unsloth for 2x faster training and Hugging Face's TRL library for efficient fine-tuning. The model is designed for tasks requiring detailed reasoning, such as mathematical problem-solving, chain-of-thought question answering, and multi-step logical analysis.

Base Model Details

The base model, unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit, is an 8-billion parameter Llama 3.1 model pre-trained for instruction-following. It uses 4-bit quantization with bitsandbytes (bnb) and Unsloth optimizations, reducing memory usage by 70% and supporting context lengths up to 8192 tokens. The architecture is an auto-regressive transformer with Grouped-Query Attention (GQA), making it efficient for conversational and reasoning tasks.

Fine-Tuning Process

Method: GRPO (Group Relative Policy Optimization)

This model was fine-tuned using GRPO, a reinforcement learning technique that enhances reasoning by generating groups of responses, scoring them with a reward function (e.g., +1 for correct reasoning, -0.1 for errors), and reinforcing higher-scoring outputs. GRPO enables the model to produce detailed reasoning traces, improving its ability to handle complex, multi-step problems.

Tools and Optimization

Unsloth: Enabled 2x faster training and reduced VRAM requirements to 5GB (as of February 2025), down from 7GB, making it trainable on a single GPU (e.g., Tesla T4).
TRL Library: Facilitated GRPO implementation and training stability.
Adapters: Used QLoRA for efficient parameter updates, maintaining performance with lower resource demands.

Usage

To use this model, load it via Hugging Face's Transformers library with Unsloth support:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("colesmcintosh/llama-3.1-8b-grpo-reasoning")

Check out Unsloth's Llama-3.1 8B GRPO Notebook.