base_model: google/gemma-2-9b-it
library_name: transformers
license: gemma
pipeline_tag: text-generation
tags:
- conversational
- llama-cpp
- matrixportal
extra_gated_heading: Access Gemma on Hugging Face
extra_gated_prompt: >-
To access Gemma on Hugging Face, you’re required to review and agree to
Google’s usage license. To do this, please ensure you’re logged in to Hugging
Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
matrixportal/gemma-2-9b-it-GGUF
This model was converted to GGUF format from google/gemma-2-9b-it
using llama.cpp via the ggml.ai's all-gguf-same-where space.
Refer to the original model card for more details on the model.
✅ Quantized Models Download List
🔍 Recommended Quantizations
- ✨ General CPU Use:
Q4_K_M
(Best balance of speed/quality) - 📱 ARM Devices:
Q4_0
(Optimized for ARM CPUs) - 🏆 Maximum Quality:
Q8_0
(Near-original quality)
📦 Full Quantization Options
🚀 Download | 🔢 Type | 📝 Notes |
---|---|---|
Download | Basic quantization | |
Download | Small size | |
Download | Balanced quality | |
Download | Better quality | |
Download | Fast on ARM | |
Download | Fast, recommended | |
Download | Best balance | |
Download | Good quality | |
Download | Balanced | |
Download | High quality | |
Download | Very good quality | |
Download | Fast, best quality | |
Download | Maximum accuracy |
💡 Tip: Use F16
for maximum precision when quality is critical
GGUF Model Quantization & Usage Guide with llama.cpp
What is GGUF and Quantization?
GGUF (GPT-Generated Unified Format) is an efficient model file format developed by the llama.cpp
team that:
- Supports multiple quantization levels
- Works cross-platform
- Enables fast loading and inference
Quantization converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
- Reduce model size
- Decrease memory usage
- Speed up inference
- (With minor accuracy trade-offs)
Step-by-Step Guide
1. Prerequisites
# System updates
sudo apt update && sudo apt upgrade -y
# Dependencies
sudo apt install -y build-essential cmake python3-pip
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4
2. Using Quantized Models from Hugging Face
My automated quantization script produces models in this format:
https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
Download your quantized model directly:
wget https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
3. Running the Quantized Model
Basic usage:
./main -m gemma-2-9b-it-q4_k_m.gguf -p "Your prompt here" -n 128
Example with a creative writing prompt:
./main -m gemma-2-9b-it-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7
Advanced parameters:
./main -m gemma-2-9b-it-q4_k_m.gguf -p "Question: What is the GGUF format?
Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
4. Python Integration
Install the Python package:
pip install llama-cpp-python
Example script:
from llama_cpp import Llama
# Initialize the model
llm = Llama(
model_path="gemma-2-9b-it-q4_k_m.gguf",
n_ctx=2048,
n_threads=8
)
# Run inference
response = llm(
"[INST] Explain GGUF quantization to a beginner [/INST]",
max_tokens=256,
temperature=0.7,
top_p=0.9
)
print(response["choices"][0]["text"])
Performance Tips
Hardware Utilization:
- Set thread count with
-t
(typically CPU core count) - Compile with CUDA/OpenCL for GPU support
- Set thread count with
Memory Optimization:
- Lower quantization (like q4_k_m) uses less RAM
- Adjust context size with
-c
parameter
Speed/Accuracy Balance:
- Higher bit quantization is slower but more accurate
- Reduce randomness with
--temp 0
for consistent results
FAQ
Q: What quantization levels are available?
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
Q: How much performance loss occurs with q4_k_m?
A: Typically 2-5% accuracy reduction but 4x smaller size
Q: How to enable GPU support?
A: Build with make LLAMA_CUBLAS=1
for NVIDIA GPUs