Llama 3-8B Finetuned with GRPO

Model Name: yuxiang204/llama3-8b-finetuned
Base Model: meta-llama/Meta-Llama-3.1-8B
Fine-tuned with: Unsloth + GRPO (Guided Reward Policy Optimization)
Quantization: Available in FP16, Q4_K_M, Q5_K_M, and Q8_0 (GGUF)
License: MIT

📌 Model Overview

This is a fine-tuned version of Meta's Llama 3.1-8B, trained with GRPO using the Unsloth framework. The fine-tuning process focused on enhancing structured reasoning and improving response quality.

It includes:

FP16 Safetensors for Hugging Face Transformers
GGUF quantized versions for fast inference in llama.cpp, Ollama, and KoboldAI
LoRA adapters for further fine-tuning

🛠 Training Details

Fine-tuning Method: GRPO (Guided Reward Policy Optimization)
Training Duration: ~10 hours
Dataset: Custom instructional dataset (mainly reasoning-based tasks)
GPU Used: A100 (80GB)

The fine-tuning aimed at improving logical reasoning, mathematical accuracy, and structured responses.

📊 Performance Comparison

Before Fine-tuning (Base Model):

“Which is bigger? 9.11 or 9.9?”

→ Inconsistent or incomplete reasoning

After GRPO Fine-tuning:

**“Both are not equal! Since 9.11 has a slightly larger decimal part than 9.9, 9.11 is actually bigger.”**

→ More structured and detailed response