matrixportal commited on
Commit
eedeb29
·
verified ·
1 Parent(s): 4019df9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +0 -123
README.md CHANGED
@@ -43,126 +43,3 @@ Refer to the [original model card](https://huggingface.co/google/gemma-2-9b-it)
43
  | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
44
 
45
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
46
-
47
- # GGUF Model Quantization & Usage Guide with llama.cpp
48
-
49
- ## What is GGUF and Quantization?
50
-
51
- **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
52
- - Supports multiple quantization levels
53
- - Works cross-platform
54
- - Enables fast loading and inference
55
-
56
- **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
57
- - Reduce model size
58
- - Decrease memory usage
59
- - Speed up inference
60
- - (With minor accuracy trade-offs)
61
-
62
- ## Step-by-Step Guide
63
-
64
- ### 1. Prerequisites
65
-
66
- ```bash
67
- # System updates
68
- sudo apt update && sudo apt upgrade -y
69
-
70
- # Dependencies
71
- sudo apt install -y build-essential cmake python3-pip
72
-
73
- # Clone and build llama.cpp
74
- git clone https://github.com/ggerganov/llama.cpp
75
- cd llama.cpp
76
- make -j4
77
- ```
78
-
79
- ### 2. Using Quantized Models from Hugging Face
80
-
81
- My automated quantization script produces models in this format:
82
- ```
83
- https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
84
- ```
85
-
86
- Download your quantized model directly:
87
-
88
- ```bash
89
- wget https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-q4_k_m.gguf
90
- ```
91
-
92
- ### 3. Running the Quantized Model
93
-
94
- Basic usage:
95
- ```bash
96
- ./main -m gemma-2-9b-it-q4_k_m.gguf -p "Your prompt here" -n 128
97
- ```
98
-
99
- Example with a creative writing prompt:
100
- ```bash
101
- ./main -m gemma-2-9b-it-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7
102
- ```
103
-
104
- Advanced parameters:
105
- ```bash
106
- ./main -m gemma-2-9b-it-q4_k_m.gguf -p "Question: What is the GGUF format?
107
- Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
108
- ```
109
-
110
- ### 4. Python Integration
111
-
112
- Install the Python package:
113
- ```bash
114
- pip install llama-cpp-python
115
- ```
116
-
117
- Example script:
118
- ```python
119
- from llama_cpp import Llama
120
-
121
- # Initialize the model
122
- llm = Llama(
123
- model_path="gemma-2-9b-it-q4_k_m.gguf",
124
- n_ctx=2048,
125
- n_threads=8
126
- )
127
-
128
- # Run inference
129
- response = llm(
130
- "[INST] Explain GGUF quantization to a beginner [/INST]",
131
- max_tokens=256,
132
- temperature=0.7,
133
- top_p=0.9
134
- )
135
-
136
- print(response["choices"][0]["text"])
137
- ```
138
-
139
- ## Performance Tips
140
-
141
- 1. **Hardware Utilization**:
142
- - Set thread count with `-t` (typically CPU core count)
143
- - Compile with CUDA/OpenCL for GPU support
144
-
145
- 2. **Memory Optimization**:
146
- - Lower quantization (like q4_k_m) uses less RAM
147
- - Adjust context size with `-c` parameter
148
-
149
- 3. **Speed/Accuracy Balance**:
150
- - Higher bit quantization is slower but more accurate
151
- - Reduce randomness with `--temp 0` for consistent results
152
-
153
- ## FAQ
154
-
155
- **Q: What quantization levels are available?**
156
- A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
157
-
158
- **Q: How much performance loss occurs with q4_k_m?**
159
- A: Typically 2-5% accuracy reduction but 4x smaller size
160
-
161
- **Q: How to enable GPU support?**
162
- A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
163
-
164
- ## Useful Resources
165
-
166
- 1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
167
- 2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
168
- 3. [Hugging Face Model Hub](https://huggingface.co/models)
 
43
  | [Download](https://huggingface.co/matrixportal/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
44
 
45
  💡 **Tip:** Use `F16` for maximum precision when quality is critical