ubergarm commited on
Commit
9193a75
·
1 Parent(s): a2dd48f

Add IQ1_S notes and prep for upload

Browse files
Files changed (1) hide show
  1. README.md +64 -5
README.md CHANGED
@@ -167,18 +167,73 @@ custom=$(
167
  </details>
168
 
169
  #### * `IQ1_S` 132.915 GiB (1.699 BPW)
170
- Special mix `IQ1_S` `ffn_(gate|up)_exps` and `IQ1_M` `ffn_down_exps` routed experts. Mostly `iq4_ks/iq3_ks` for attn and shared expert. `iq4_k` `token_embd` and `iq5_k` `output` "head".
171
 
172
- WIP
173
 
174
- TODO Perplexity
175
 
176
  <details>
177
 
178
  <summary>👈 Secret Recipe</summary>
179
 
180
  ```bash
181
- echo TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  ```
183
 
184
  </details>
@@ -213,7 +268,7 @@ cmake --build ./build --config Release -j $(nproc)
213
  ```
214
  Adjust `--threads` to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying `--threads` and `--threads-batch` for larger server rigs.
215
 
216
- If you OOM on VRAM, remove the additional `-ot "...=CUDA0"` or you can increase offload layers if you have more VRAM onto multi-GPU targets e.g. CUDA1 etc.
217
 
218
  Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.
219
 
@@ -221,6 +276,10 @@ Generally `-ub 2048 -b 2048` or `-ub 4096 -b 4096` can give *much* faster PP spe
221
 
222
  Use `llama-sweep-bench --warmup-batch ...` to benchmark various configurations with your hardware to report to the community!
223
 
 
 
 
 
224
  ## References
225
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
226
  * [Larger ik quants available here: Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF](https://huggingface.co/Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF)
 
167
  </details>
168
 
169
  #### * `IQ1_S` 132.915 GiB (1.699 BPW)
170
+ Not recommended. "For the desperate". If you can fit a larger model in RAM+VRAM choose a larger model as it might even run faster and will definitely have better perplexity (likely better quality).
171
 
172
+ Special mix `IQ1_S` `ffn_(gate|up)_exps` and `IQ1_M` `ffn_down_exps` routed experts. Mostly `iq4_ks/iq3_ks` for attn and shared expert. `iq4_k` `token_embd` and `iq5_k` `output` "head".
173
 
174
+ Final estimate: PPL = 4.9878 +/- 0.02999
175
 
176
  <details>
177
 
178
  <summary>👈 Secret Recipe</summary>
179
 
180
  ```bash
181
+ #!/usr/bin/env bash
182
+
183
+ custom="
184
+ # First 3 dense layers (0-3) (GPU)
185
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
186
+ blk\.[0-2]\.attn_k_b.*=q4_0
187
+ blk\.[0-2]\.attn_.*=iq4_ks
188
+ blk\.[0-2]\.ffn_down.*=iq4_ks
189
+ blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
190
+ blk\.[0-2]\..*=iq4_ks
191
+
192
+ # All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
193
+ # Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
194
+ blk\.[3-9]\.attn_k_b.*=q4_0
195
+ blk\.[1-5][0-9]\.attn_k_b.*=q4_0
196
+ blk\.60\.attn_k_b.*=q4_0
197
+
198
+ blk\.[3-9]\.attn_.*=iq4_ks
199
+ blk\.[1-5][0-9]\.attn_.*=iq4_ks
200
+ blk\.60\.attn_.*=iq4_ks
201
+
202
+ # Shared Expert (3-60) (GPU)
203
+ blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
204
+ blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
205
+ blk\.60\.ffn_down_shexp\.weight=iq4_ks
206
+
207
+ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
208
+ blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
209
+ blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks
210
+
211
+ # Routed Experts (3-60) (CPU)
212
+ blk\.[3-9]\.ffn_down_exps\.weight=iq1_m
213
+ blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m
214
+ blk\.60\.ffn_down_exps\.weight=iq1_m
215
+
216
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s
217
+ blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s
218
+ blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s
219
+
220
+ # Token embedding and output tensors (GPU)
221
+ token_embd\.weight=iq4_k
222
+ output\.weight=iq5_k
223
+ "
224
+
225
+ custom=$(
226
+ echo "$custom" | grep -v '^#' | \
227
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
228
+ )
229
+
230
+ ./build/bin/llama-quantize \
231
+ --custom-q "$custom" \
232
+ --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
233
+ /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
234
+ /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ1_S.gguf \
235
+ IQ1_S \
236
+ 24
237
  ```
238
 
239
  </details>
 
268
  ```
269
  Adjust `--threads` to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying `--threads` and `--threads-batch` for larger server rigs.
270
 
271
+ If you OOM on VRAM, remove the additional `-ot "...=CUDA0"` or you can increase offload layers if you have more VRAM with multi-GPU targets e.g. `-ot "blk\.(5|6)\.ffn_.*=CUDA1" \`.
272
 
273
  Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.
274
 
 
276
 
277
  Use `llama-sweep-bench --warmup-batch ...` to benchmark various configurations with your hardware to report to the community!
278
 
279
+ ## TODO
280
+ - [ ] Given the `IQ1_S_R4` is not symmetric with `IQ1_S` it doesn't work with `-rtr` so I might look into releasing an `_R4` variant after some `llama-sweep-bench` testing.
281
+ - [ ] Consider a slightly larger model? (gotta free up some disk space lol)
282
+
283
  ## References
284
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
285
  * [Larger ik quants available here: Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF](https://huggingface.co/Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF)