Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +0 -123

README.md CHANGED Viewed

@@ -43,126 +43,3 @@ Refer to the [original model card](https://huggingface.co/google/gemma-2-9b-it)
 | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
-# GGUF Model Quantization & Usage Guide with llama.cpp
-## What is GGUF and Quantization?
-**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
-- Supports multiple quantization levels
-- Works cross-platform
-- Enables fast loading and inference
-**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
-- Reduce model size
-- Decrease memory usage
-- Speed up inference
-- (With minor accuracy trade-offs)
-## Step-by-Step Guide
-### 1. Prerequisites
-```bash
-# System updates
-sudo apt update && sudo apt upgrade -y
-# Dependencies
-sudo apt install -y build-essential cmake python3-pip
-# Clone and build llama.cpp
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-make -j4
-```
-### 2. Using Quantized Models from Hugging Face
-My automated quantization script produces models in this format:
-```
-https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
-```
-Download your quantized model directly:
-```bash
-wget https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
-```
-### 3. Running the Quantized Model
-Basic usage:
-```bash
-./main -m gemma-2-9b-it-q4_k_m.gguf -p "Your prompt here" -n 128
-```
-Example with a creative writing prompt:
-```bash
-./main -m gemma-2-9b-it-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7
-```
-Advanced parameters:
-```bash
-./main -m gemma-2-9b-it-q4_k_m.gguf        -p "Question: What is the GGUF format?
-Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
-```
-### 4. Python Integration
-Install the Python package:
-```bash
-pip install llama-cpp-python
-```
-Example script:
-```python
-from llama_cpp import Llama
-# Initialize the model
-llm = Llama(
-    model_path="gemma-2-9b-it-q4_k_m.gguf",
-    n_ctx=2048,
-    n_threads=8
-)
-# Run inference
-response = llm(
-    "[INST] Explain GGUF quantization to a beginner [/INST]",
-    max_tokens=256,
-    temperature=0.7,
-    top_p=0.9
-)
-print(response["choices"][0]["text"])
-```
-## Performance Tips
-1. **Hardware Utilization**:
-   - Set thread count with `-t` (typically CPU core count)
-   - Compile with CUDA/OpenCL for GPU support
-2. **Memory Optimization**:
-   - Lower quantization (like q4_k_m) uses less RAM
-   - Adjust context size with `-c` parameter
-3. **Speed/Accuracy Balance**:
-   - Higher bit quantization is slower but more accurate
-   - Reduce randomness with `--temp 0` for consistent results
-## FAQ
-**Q: What quantization levels are available?**
-A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
-**Q: How much performance loss occurs with q4_k_m?**
-A: Typically 2-5% accuracy reduction but 4x smaller size
-**Q: How to enable GPU support?**
-A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
-## Useful Resources
-1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
-2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
-3. [Hugging Face Model Hub](https://huggingface.co/models)


43	\| [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) \| ![F16](https://img.shields.io/badge/F16-000000) \| Maximum accuracy \|
44
45	💡 Tip: Use `F16` for maximum precision when quality is critical