ubergarm
/

DeepSeek-TNG-R1T2-Chimera-GGUF

Text Generation

GGUF

mla

imatrix

conversational

ik_llama.cpp

Model card Files Files and versions Community

ubergarm commited on 10 days ago

Commit

9193a75

1 Parent(s): a2dd48f

Add IQ1_S notes and prep for upload

Browse files

Files changed (1) hide show

README.md +64 -5

README.md CHANGED Viewed

@@ -167,18 +167,73 @@ custom=$(
 </details>
 #### * `IQ1_S` 132.915 GiB (1.699 BPW)
-Special mix `IQ1_S` `ffn_(gate|up)_exps` and `IQ1_M` `ffn_down_exps` routed experts. Mostly `iq4_ks/iq3_ks` for attn and shared expert. `iq4_k` `token_embd` and `iq5_k` `output` "head".
-WIP
-TODO Perplexity
 <details>
 <summary>👈 Secret Recipe</summary>
 ```bash
-echo TODO
 ```
 </details>
@@ -213,7 +268,7 @@ cmake --build ./build --config Release -j $(nproc)
 ```
 Adjust `--threads` to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying `--threads` and `--threads-batch` for larger server rigs.
-If you OOM on VRAM, remove the additional `-ot "...=CUDA0"` or you can increase offload layers if you have more VRAM onto multi-GPU targets e.g. CUDA1 etc.
 Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.
@@ -221,6 +276,10 @@ Generally `-ub 2048 -b 2048` or `-ub 4096 -b 4096` can give *much* faster PP spe
 Use `llama-sweep-bench --warmup-batch ...` to benchmark various configurations with your hardware to report to the community!
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
 * [Larger ik quants available here: Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF](https://huggingface.co/Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF)

 </details>
 #### * `IQ1_S` 132.915 GiB (1.699 BPW)
+Not recommended. "For the desperate". If you can fit a larger model in RAM+VRAM choose a larger model as it might even run faster and will definitely have better perplexity (likely better quality).
+Special mix `IQ1_S` `ffn_(gate|up)_exps` and `IQ1_M` `ffn_down_exps` routed experts. Mostly `iq4_ks/iq3_ks` for attn and shared expert. `iq4_k` `token_embd` and `iq5_k` `output` "head".
+Final estimate: PPL = 4.9878 +/- 0.02999
 <details>
 <summary>👈 Secret Recipe</summary>
 ```bash
+#!/usr/bin/env bash
+custom="
+# First 3 dense layers (0-3) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[0-2]\.attn_k_b.*=q4_0
+blk\.[0-2]\.attn_.*=iq4_ks
+blk\.[0-2]\.ffn_down.*=iq4_ks
+blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
+blk\.[0-2]\..*=iq4_ks
+# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[3-9]\.attn_k_b.*=q4_0
+blk\.[1-5][0-9]\.attn_k_b.*=q4_0
+blk\.60\.attn_k_b.*=q4_0
+blk\.[3-9]\.attn_.*=iq4_ks
+blk\.[1-5][0-9]\.attn_.*=iq4_ks
+blk\.60\.attn_.*=iq4_ks
+# Shared Expert (3-60) (GPU)
+blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
+blk\.60\.ffn_down_shexp\.weight=iq4_ks
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
+blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks
+# Routed Experts (3-60) (CPU)
+blk\.[3-9]\.ffn_down_exps\.weight=iq1_m
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m
+blk\.60\.ffn_down_exps\.weight=iq1_m
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s
+blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s
+# Token embedding and output tensors (GPU)
+token_embd\.weight=iq4_k
+output\.weight=iq5_k
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
+    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
+    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ1_S.gguf \
+    IQ1_S \
+    24
 ```
 </details>
 ```
 Adjust `--threads` to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying `--threads` and `--threads-batch` for larger server rigs.
+If you OOM on VRAM, remove the additional `-ot "...=CUDA0"` or you can increase offload layers if you have more VRAM with multi-GPU targets e.g. `-ot "blk\.(5|6)\.ffn_.*=CUDA1" \`.
 Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.
 Use `llama-sweep-bench --warmup-batch ...` to benchmark various configurations with your hardware to report to the community!
+## TODO
+- [ ] Given the `IQ1_S_R4` is not symmetric with `IQ1_S` it doesn't work with `-rtr` so I might look into releasing an `_R4` variant after some `llama-sweep-bench` testing.
+- [ ] Consider a slightly larger model? (gotta free up some disk space lol)
 ## References
 * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
 * [Larger ik quants available here: Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF](https://huggingface.co/Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF)