This is a Q6_K quantization of google/gemma-2-9b-it with Q8_0 for Output/Embedding weights using LLaMA.cpp version b4617.

Model Quantization Guide for Q6_K_L with Q8_0 Output/Embedding

Requirements

llama.cpp source code
The model weights from Hugging Face
Python 3.x with pip
C++ build tools

Setup

Clone repositories:

git clone https://github.com/ggerganov/llama.cpp
git clone https://huggingface.co/{MODEL_REPO}

Compile llama.cpp following CPU build instructions

Conversion to GGUF

Install Python dependencies:

pip install -r llama.cpp/requirements.txt

Convert weights:

python llama.cpp/convert_hf_to_gguf.py \
    /path/to/model \
    --outtype f32 \
    --outfile /output/path/model-f32.gguf

Quantization

From llama.cpp/build/bin directory:

./llama-quantize \
    --output-tensor-type Q8_0 \
    --token-embedding-type Q8_0 \
    /input/path/model-f32.gguf \
    /output/path/model-Q6_K_L.gguf \
    Q6_K

Usage

The quantized model (model-Q6_K_L.gguf) can be used with:

llama.cpp's CLI
llama.cpp's server webui or api
Other GGUF-compatible tools

Notes

Replace /path/to/model with your actual model directory path
Replace {MODEL_REPO} with the Hugging Face repository path of your model
Quantization parameters:
- Q6_K: 6-bit quantization for majority of weights
- Q8_0: 8-bit for output/embedding tensors
This process works for Hugging Face models compatible with llama.cpp conversion

You can use this template by replacing the placeholder values with your specific model information:

MODEL_REPO: Your model's Hugging Face repository path
model: Your model's name in file paths

guiopen
/

gemma-2-9b-it-Q6_K_L-GGUF