This is a Q6_K quantization of google/gemma-2-9b-it with Q8_0 for Output/Embedding weights using LLaMA.cpp version b4617.
Model Quantization Guide for Q6_K_L with Q8_0 Output/Embedding
Requirements
- llama.cpp source code
- The model weights from Hugging Face
- Python 3.x with pip
- C++ build tools
Setup
- Clone repositories:
git clone https://github.com/ggerganov/llama.cpp
git clone https://huggingface.co/{MODEL_REPO}
- Compile llama.cpp following CPU build instructions
Conversion to GGUF
- Install Python dependencies:
pip install -r llama.cpp/requirements.txt
- Convert weights:
python llama.cpp/convert_hf_to_gguf.py \
/path/to/model \
--outtype f32 \
--outfile /output/path/model-f32.gguf
Quantization
From llama.cpp/build/bin
directory:
./llama-quantize \
--output-tensor-type Q8_0 \
--token-embedding-type Q8_0 \
/input/path/model-f32.gguf \
/output/path/model-Q6_K_L.gguf \
Q6_K
Usage
The quantized model (model-Q6_K_L.gguf
) can be used with:
- llama.cpp's CLI
- llama.cpp's server webui or api
- Other GGUF-compatible tools
Notes
- Replace
/path/to/model
with your actual model directory path - Replace
{MODEL_REPO}
with the Hugging Face repository path of your model - Quantization parameters:
- Q6_K: 6-bit quantization for majority of weights
- Q8_0: 8-bit for output/embedding tensors
- This process works for Hugging Face models compatible with llama.cpp conversion
You can use this template by replacing the placeholder values with your specific model information:
- MODEL_REPO: Your model's Hugging Face repository path
- model: Your model's name in file paths
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.