Llama 3-8B Finetuned with GRPO

Model Name: yuxiang204/llama3-8b-finetuned
Base Model: meta-llama/Meta-Llama-3.1-8B
Fine-tuned with: Unsloth + GRPO (Guided Reward Policy Optimization)
Quantization: Available in FP16, Q4_K_M, Q5_K_M, and Q8_0 (GGUF)
License: MIT

πŸ“Œ Model Overview

This is a fine-tuned version of Meta's Llama 3.1-8B, trained with GRPO using the Unsloth framework. The fine-tuning process focused on enhancing structured reasoning and improving response quality.

It includes:

  • FP16 Safetensors for Hugging Face Transformers
  • GGUF quantized versions for fast inference in llama.cpp, Ollama, and KoboldAI
  • LoRA adapters for further fine-tuning

πŸ›  Training Details

  • Fine-tuning Method: GRPO (Guided Reward Policy Optimization)
  • Training Duration: ~10 hours
  • Dataset: Custom instructional dataset (mainly reasoning-based tasks)
  • GPU Used: A100 (80GB)

The fine-tuning aimed at improving logical reasoning, mathematical accuracy, and structured responses.

πŸ“Š Performance Comparison

Before Fine-tuning (Base Model):

β€œWhich is bigger? 9.11 or 9.9?”

β†’ Inconsistent or incomplete reasoning

After GRPO Fine-tuning:

**β€œBoth are not equal! Since 9.11 has a slightly larger decimal part than 9.9, 9.11 is actually bigger.”**

β†’ More structured and detailed response

Downloads last month
2
Safetensors
Model size
8.03B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support