matrixportal commited on
Commit
30e54d7
·
verified ·
1 Parent(s): 1c2d107

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md CHANGED
@@ -43,3 +43,126 @@ Refer to the [original model card](https://huggingface.co/google/gemma-2-9b-it)
43
  | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
44
 
45
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
44
 
45
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
46
+
47
+ # GGUF Model Quantization & Usage Guide with llama.cpp
48
+
49
+ ## What is GGUF and Quantization?
50
+
51
+ **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
52
+ - Supports multiple quantization levels
53
+ - Works cross-platform
54
+ - Enables fast loading and inference
55
+
56
+ **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
57
+ - Reduce model size
58
+ - Decrease memory usage
59
+ - Speed up inference
60
+ - (With minor accuracy trade-offs)
61
+
62
+ ## Step-by-Step Guide
63
+
64
+ ### 1. Prerequisites
65
+
66
+ ```bash
67
+ # System updates
68
+ sudo apt update && sudo apt upgrade -y
69
+
70
+ # Dependencies
71
+ sudo apt install -y build-essential cmake python3-pip
72
+
73
+ # Clone and build llama.cpp
74
+ git clone https://github.com/ggerganov/llama.cpp
75
+ cd llama.cpp
76
+ make -j4
77
+ ```
78
+
79
+ ### 2. Using Quantized Models from Hugging Face
80
+
81
+ My automated quantization script produces models in this format:
82
+ ```
83
+ https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
84
+ ```
85
+
86
+ Download your quantized model directly:
87
+
88
+ ```bash
89
+ wget https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
90
+ ```
91
+
92
+ ### 3. Running the Quantized Model
93
+
94
+ Basic usage:
95
+ ```bash
96
+ ./main -m gemma-2-9b-it-q4_k_m.gguf -p "Your prompt here" -n 128
97
+ ```
98
+
99
+ Example with a creative writing prompt:
100
+ ```bash
101
+ ./main -m gemma-2-9b-it-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7
102
+ ```
103
+
104
+ Advanced parameters:
105
+ ```bash
106
+ ./main -m gemma-2-9b-it-q4_k_m.gguf -p "Question: What is the GGUF format?
107
+ Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
108
+ ```
109
+
110
+ ### 4. Python Integration
111
+
112
+ Install the Python package:
113
+ ```bash
114
+ pip install llama-cpp-python
115
+ ```
116
+
117
+ Example script:
118
+ ```python
119
+ from llama_cpp import Llama
120
+
121
+ # Initialize the model
122
+ llm = Llama(
123
+ model_path="gemma-2-9b-it-q4_k_m.gguf",
124
+ n_ctx=2048,
125
+ n_threads=8
126
+ )
127
+
128
+ # Run inference
129
+ response = llm(
130
+ "[INST] Explain GGUF quantization to a beginner [/INST]",
131
+ max_tokens=256,
132
+ temperature=0.7,
133
+ top_p=0.9
134
+ )
135
+
136
+ print(response["choices"][0]["text"])
137
+ ```
138
+
139
+ ## Performance Tips
140
+
141
+ 1. **Hardware Utilization**:
142
+ - Set thread count with `-t` (typically CPU core count)
143
+ - Compile with CUDA/OpenCL for GPU support
144
+
145
+ 2. **Memory Optimization**:
146
+ - Lower quantization (like q4_k_m) uses less RAM
147
+ - Adjust context size with `-c` parameter
148
+
149
+ 3. **Speed/Accuracy Balance**:
150
+ - Higher bit quantization is slower but more accurate
151
+ - Reduce randomness with `--temp 0` for consistent results
152
+
153
+ ## FAQ
154
+
155
+ **Q: What quantization levels are available?**
156
+ A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
157
+
158
+ **Q: How much performance loss occurs with q4_k_m?**
159
+ A: Typically 2-5% accuracy reduction but 4x smaller size
160
+
161
+ **Q: How to enable GPU support?**
162
+ A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
163
+
164
+ ## Useful Resources
165
+
166
+ 1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
167
+ 2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
168
+ 3. [Hugging Face Model Hub](https://huggingface.co/models)