matrixportal
/

gemma-2-9b-it-GGUF

@@ -43,3 +43,126 @@ Refer to the [original model card](https://huggingface.co/google/gemma-2-9b-it)
 | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical

 | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
+# GGUF Model Quantization & Usage Guide with llama.cpp
+## What is GGUF and Quantization?
+**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
+- Supports multiple quantization levels
+- Works cross-platform
+- Enables fast loading and inference
+**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
+- Reduce model size
+- Decrease memory usage
+- Speed up inference
+- (With minor accuracy trade-offs)
+## Step-by-Step Guide
+### 1. Prerequisites
+```bash
+# System updates
+sudo apt update && sudo apt upgrade -y
+# Dependencies
+sudo apt install -y build-essential cmake python3-pip
+# Clone and build llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make -j4
+```
+### 2. Using Quantized Models from Hugging Face
+My automated quantization script produces models in this format:
+```
+https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
+```
+Download your quantized model directly:
+```bash
+wget https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
+```
+### 3. Running the Quantized Model
+Basic usage:
+```bash
+./main -m gemma-2-9b-it-q4_k_m.gguf -p "Your prompt here" -n 128
+```
+Example with a creative writing prompt:
+```bash
+./main -m gemma-2-9b-it-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7
+```
+Advanced parameters:
+```bash
+./main -m gemma-2-9b-it-q4_k_m.gguf        -p "Question: What is the GGUF format?
+Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
+```
+### 4. Python Integration
+Install the Python package:
+```bash
+pip install llama-cpp-python
+```
+Example script:
+```python
+from llama_cpp import Llama
+# Initialize the model
+llm = Llama(
+    model_path="gemma-2-9b-it-q4_k_m.gguf",
+    n_ctx=2048,
+    n_threads=8
+)
+# Run inference
+response = llm(
+    "[INST] Explain GGUF quantization to a beginner [/INST]",
+    max_tokens=256,
+    temperature=0.7,
+    top_p=0.9
+)
+print(response["choices"][0]["text"])
+```
+## Performance Tips
+1. **Hardware Utilization**:
+   - Set thread count with `-t` (typically CPU core count)
+   - Compile with CUDA/OpenCL for GPU support
+2. **Memory Optimization**:
+   - Lower quantization (like q4_k_m) uses less RAM
+   - Adjust context size with `-c` parameter
+3. **Speed/Accuracy Balance**:
+   - Higher bit quantization is slower but more accurate
+   - Reduce randomness with `--temp 0` for consistent results
+## FAQ
+**Q: What quantization levels are available?**
+A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
+**Q: How much performance loss occurs with q4_k_m?**
+A: Typically 2-5% accuracy reduction but 4x smaller size
+**Q: How to enable GPU support?**
+A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
+## Useful Resources
+1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
+2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
+3. [Hugging Face Model Hub](https://huggingface.co/models)