This is a Q6_K quantization of google/gemma-2-9b-it with Q8_0 for Output/Embedding weights using LLaMA.cpp version b4617.

Model Quantization Guide for Q6_K_L with Q8_0 Output/Embedding

Requirements

  • llama.cpp source code
  • The model weights from Hugging Face
  • Python 3.x with pip
  • C++ build tools

Setup

  1. Clone repositories:
git clone https://github.com/ggerganov/llama.cpp
git clone https://huggingface.co/{MODEL_REPO}
  1. Compile llama.cpp following CPU build instructions

Conversion to GGUF

  1. Install Python dependencies:
pip install -r llama.cpp/requirements.txt
  1. Convert weights:
python llama.cpp/convert_hf_to_gguf.py \
    /path/to/model \
    --outtype f32 \
    --outfile /output/path/model-f32.gguf

Quantization

From llama.cpp/build/bin directory:

./llama-quantize \
    --output-tensor-type Q8_0 \
    --token-embedding-type Q8_0 \
    /input/path/model-f32.gguf \
    /output/path/model-Q6_K_L.gguf \
    Q6_K

Usage

The quantized model (model-Q6_K_L.gguf) can be used with:

  • llama.cpp's CLI
  • llama.cpp's server webui or api
  • Other GGUF-compatible tools

Notes

  1. Replace /path/to/model with your actual model directory path
  2. Replace {MODEL_REPO} with the Hugging Face repository path of your model
  3. Quantization parameters:
    • Q6_K: 6-bit quantization for majority of weights
    • Q8_0: 8-bit for output/embedding tensors
  4. This process works for Hugging Face models compatible with llama.cpp conversion

You can use this template by replacing the placeholder values with your specific model information:

  • MODEL_REPO: Your model's Hugging Face repository path
  • model: Your model's name in file paths
Downloads last month
4
GGUF
Model size
9.24B params
Architecture
gemma2

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for guiopen/gemma-2-9b-it-Q6_K_L-GGUF

Base model

google/gemma-2-9b
Quantized
(146)
this model