quantized_by: ubergarm
pipeline_tag: text-generation
base_model: tngtech/DeepSeek-TNG-R1T2-Chimera
license: mit
base_model_relation: quantized
tags:
- mla
- imatrix
- conversational
- ik_llama.cpp
ik_llama.cpp
imatrix Quantizations of DeepSeek-TNG-R1T2-Chimera
This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp
can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quants
For some larger non-imatrix ik quant options check out Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF
* IQ3_KS
281.463 GiB (3.598 BPW)
Special mix with all new IQ3_KS
ffn_(gate|up)_exps
and IQ4_KS
ffn_down_exps
routed experts. Mostly iq5_ks/iq4_ks
for attn and shared expert. iq5_k
token_embd
and iq6_k
output
"head".
Final estimate: PPL = 3.3167 +/- 0.01789
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q5_0
blk\.[0-2]\.attn_.*=iq5_ks
blk\.[0-2]\.ffn_down.*=iq5_ks
blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
blk\.[0-2]\..*=iq5_ks
# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q5_0
blk\.[1-5][0-9]\.attn_k_b.*=q5_0
blk\.60\.attn_k_b.*=q5_0
blk\.[3-9]\.attn_.*=iq5_ks
blk\.[1-5][0-9]\.attn_.*=iq5_ks
blk\.60\.attn_.*=iq5_ks
# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.60\.ffn_down_shexp\.weight=iq5_ks
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
blk\.60\.ffn_down_exps\.weight=iq4_ks
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq3_ks
# Token embedding and output tensors (GPU)
token_embd\.weight=iq5_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ3_KS.gguf \
IQ3_KS \
24
* IQ2_KS
203.553 GiB (2.602 BPW)
Special mix with IQ2_KS
ffn_(gate|up)_exps
and new IQ3_KS
ffn_down_exps
routed experts. Mostly iq5_ks/iq4_ks
for attn and shared expert. iq5_k
token_embd
and iq6_k
output
"head".
Final estimate: PPL = 3.6254 +/- 0.02001
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q5_0
blk\.[0-2]\.attn_.*=iq5_ks
blk\.[0-2]\.ffn_down.*=iq5_ks
blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
blk\.[0-2]\..*=iq5_ks
# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q5_0
blk\.[1-5][0-9]\.attn_k_b.*=q5_0
blk\.60\.attn_k_b.*=q5_0
blk\.[3-9]\.attn_.*=iq5_ks
blk\.[1-5][0-9]\.attn_.*=iq5_ks
blk\.60\.attn_.*=iq5_ks
# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.60\.ffn_down_shexp\.weight=iq5_ks
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_ks
blk\.60\.ffn_down_exps\.weight=iq3_ks
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_ks
# Token embedding and output tensors (GPU)
token_embd\.weight=iq5_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_KS.gguf \
IQ2_KS \
24
* IQ2_KT
171.146 GiB (2.188 BPW)
Designed for RTX 6000 PRO Blackwell with 192GB total VRAM full offload with (hopefully) full 160k context and sufficiently large batch sizes. These KT
quant types are quite fast on CUDA but not as fast TG on CPU inferencing.
Special mix new trellis quants (like QTIP/EXL3 style) IQ2_KT
ffn_(gate|down|up)_exps
routed experts. Mostly iq4_kt/iq3_kt
for attn and shared expert. iq4_k
token_embd
and iq5_k
output
"head".
Final estimate: PPL = 3.8887 +/- 0.02191
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=iq4_nl
blk\.[0-2]\.attn_.*=iq4_kt
blk\.[0-2]\.ffn_down.*=iq4_kt
blk\.[0-2]\.ffn_(gate|up).*=iq3_kt
blk\.[0-2]\..*=iq4_kt
# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=iq4_nl
blk\.[1-5][0-9]\.attn_k_b.*=iq4_nl
blk\.60\.attn_k_b.*=iq4_nl
blk\.[3-9]\.attn_.*=iq4_kt
blk\.[1-5][0-9]\.attn_.*=iq4_kt
blk\.60\.attn_.*=iq4_kt
# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_kt
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_kt
blk\.60\.ffn_down_shexp\.weight=iq4_kt
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_kt
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_kt
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_kt
# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq2_kt
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq2_kt
blk\.60\.ffn_down_exps\.weight=iq2_kt
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_kt
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_kt
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_kt
# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_kt
output\.weight=iq5_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_KT.gguf \
IQ2_KT \
24
* IQ2_XXS
169.590 GiB (2.168 BPW)
Not recommended, but should be faster and better quality than the IQ1_S and okay with full offload on multi-GPU. It should be okay for hybrid CPU+GPU inference as well if this size is good for your rig. Probably want to choose the IQ2_KT for full GPU offload.
Special mix IQ2_XXS
ffn_(gate|up)_exps
and IQ2_KS
ffn_down_exps
routed experts. Mostly iq4_ks/iq3_ks
for attn and shared expert. iq4_k
token_embd
and iq5_k
output
"head".
Final estimate: PPL = 4.0078 +/- 0.02291
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q4_0
blk\.[0-2]\.attn_.*=iq4_ks
blk\.[0-2]\.ffn_down.*=iq4_ks
blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
blk\.[0-2]\..*=iq4_ks
# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q4_0
blk\.[1-5][0-9]\.attn_k_b.*=q4_0
blk\.60\.attn_k_b.*=q4_0
blk\.[3-9]\.attn_.*=iq4_ks
blk\.[1-5][0-9]\.attn_.*=iq4_ks
blk\.60\.attn_.*=iq4_ks
# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.60\.ffn_down_shexp\.weight=iq4_ks
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks
# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq2_ks
blk\.60\.ffn_down_exps\.weight=iq2_ks
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_xxs
# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq5_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_XXS.gguf \
IQ2_XXS \
24
* IQ1_S
132.915 GiB (1.699 BPW)
Not recommended. "For the desperate". If you can fit a larger model in RAM+VRAM choose a larger model as it might even run faster and will definitely have better perplexity (likely better quality).
Special mix IQ1_S
ffn_(gate|up)_exps
and IQ1_M
ffn_down_exps
routed experts. Mostly iq4_ks/iq3_ks
for attn and shared expert. iq4_k
token_embd
and iq5_k
output
"head".
Final estimate: PPL = 4.9878 +/- 0.02999
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q4_0
blk\.[0-2]\.attn_.*=iq4_ks
blk\.[0-2]\.ffn_down.*=iq4_ks
blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
blk\.[0-2]\..*=iq4_ks
# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q4_0
blk\.[1-5][0-9]\.attn_k_b.*=q4_0
blk\.60\.attn_k_b.*=q4_0
blk\.[3-9]\.attn_.*=iq4_ks
blk\.[1-5][0-9]\.attn_.*=iq4_ks
blk\.60\.attn_.*=iq4_ks
# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.60\.ffn_down_shexp\.weight=iq4_ks
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks
# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq1_m
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m
blk\.60\.ffn_down_exps\.weight=iq1_m
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s
blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s
# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq5_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ1_S.gguf \
IQ1_S \
24
Quick Start
## clone latest ik_llama.cpp
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
## build for hybrid CUDA and CPU DeepSeek inferencing
# apt-get install build-essential cmake ccache nvidia-cuda-toolkit # plus anything you need
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)
## Run api server
./build/bin/llama-server \
--model /models/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf \
--alias ubergarm/DeepSeek-TNG-R1T2-Chimera-IQ3_KS \
-fa \
-mla 3 -fmoe -amb 512 \
--ctx-size 32768 \
-ctk q8_0 \
-ngl 99 \
-ot "blk\.(3|4)\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 1024 -b 2048 \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
Adjust --threads
to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying --threads
and --threads-batch
for larger server rigs.
If you OOM on VRAM, remove the additional -ot "...=CUDA0"
or you can increase offload layers if you have more VRAM with multi-GPU targets e.g. -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
.
Test out -rtr
to run-time-repack tensors to _r4
variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.
Generally -ub 2048 -b 2048
or -ub 4096 -b 4096
can give much faster PP speeds at the cost of some additional VRAM. Test against leavng it at the default -ub 512 -b 2048
.
Use llama-sweep-bench --warmup-batch ...
to benchmark various configurations with your hardware to report to the community!
TODO
- Given the
IQ1_S_R4
is not symmetric withIQ1_S
it doesn't work with-rtr
so I might look into releasing an_R4
variant after somellama-sweep-bench
testing. - Consider a slightly larger model? (gotta free up some disk space lol)