ubergarm's picture
Add IQ2_KT for full dual RTX 6000 PRO offload
1e0daab
metadata
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: tngtech/DeepSeek-TNG-R1T2-Chimera
license: mit
base_model_relation: quantized
tags:
  - mla
  - imatrix
  - conversational
  - ik_llama.cpp

ik_llama.cpp imatrix Quantizations of DeepSeek-TNG-R1T2-Chimera

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quants

For some larger non-imatrix ik quant options check out Kebob/DeepSeek-TNG-R1T2-Chimera-IK_GGUF

* IQ3_KS 281.463 GiB (3.598 BPW)

Special mix with all new IQ3_KS ffn_(gate|up)_exps and IQ4_KS ffn_down_exps routed experts. Mostly iq5_ks/iq4_ks for attn and shared expert. iq5_k token_embd and iq6_k output "head".

Final estimate: PPL = 3.3167 +/- 0.01789

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q5_0
blk\.[0-2]\.attn_.*=iq5_ks
blk\.[0-2]\.ffn_down.*=iq5_ks
blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
blk\.[0-2]\..*=iq5_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q5_0
blk\.[1-5][0-9]\.attn_k_b.*=q5_0
blk\.60\.attn_k_b.*=q5_0

blk\.[3-9]\.attn_.*=iq5_ks
blk\.[1-5][0-9]\.attn_.*=iq5_ks
blk\.60\.attn_.*=iq5_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.60\.ffn_down_shexp\.weight=iq5_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
blk\.60\.ffn_down_exps\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq3_ks

# Token embedding and output tensors (GPU)
token_embd\.weight=iq5_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ3_KS.gguf \
    IQ3_KS \
    24

* IQ2_KS 203.553 GiB (2.602 BPW)

Special mix with IQ2_KS ffn_(gate|up)_exps and new IQ3_KS ffn_down_exps routed experts. Mostly iq5_ks/iq4_ks for attn and shared expert. iq5_k token_embd and iq6_k output "head".

Final estimate: PPL = 3.6254 +/- 0.02001

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q5_0
blk\.[0-2]\.attn_.*=iq5_ks
blk\.[0-2]\.ffn_down.*=iq5_ks
blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
blk\.[0-2]\..*=iq5_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q5_0
blk\.[1-5][0-9]\.attn_k_b.*=q5_0
blk\.60\.attn_k_b.*=q5_0

blk\.[3-9]\.attn_.*=iq5_ks
blk\.[1-5][0-9]\.attn_.*=iq5_ks
blk\.60\.attn_.*=iq5_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.60\.ffn_down_shexp\.weight=iq5_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_ks
blk\.60\.ffn_down_exps\.weight=iq3_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_ks

# Token embedding and output tensors (GPU)
token_embd\.weight=iq5_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_KS.gguf \
    IQ2_KS \
    24

* IQ2_KT 171.146 GiB (2.188 BPW)

Designed for RTX 6000 PRO Blackwell with 192GB total VRAM full offload with (hopefully) full 160k context and sufficiently large batch sizes. These KT quant types are quite fast on CUDA but not as fast TG on CPU inferencing.

Special mix new trellis quants (like QTIP/EXL3 style) IQ2_KT ffn_(gate|down|up)_exps routed experts. Mostly iq4_kt/iq3_kt for attn and shared expert. iq4_k token_embd and iq5_k output "head".

Final estimate: PPL = 3.8887 +/- 0.02191

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=iq4_nl
blk\.[0-2]\.attn_.*=iq4_kt
blk\.[0-2]\.ffn_down.*=iq4_kt
blk\.[0-2]\.ffn_(gate|up).*=iq3_kt
blk\.[0-2]\..*=iq4_kt

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=iq4_nl
blk\.[1-5][0-9]\.attn_k_b.*=iq4_nl
blk\.60\.attn_k_b.*=iq4_nl

blk\.[3-9]\.attn_.*=iq4_kt
blk\.[1-5][0-9]\.attn_.*=iq4_kt
blk\.60\.attn_.*=iq4_kt

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_kt
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_kt
blk\.60\.ffn_down_shexp\.weight=iq4_kt

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_kt
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_kt
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_kt

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq2_kt
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq2_kt
blk\.60\.ffn_down_exps\.weight=iq2_kt

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_kt
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_kt
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_kt

# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_kt
output\.weight=iq5_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_KT.gguf \
    IQ2_KT \
    24

* IQ2_XXS 169.590 GiB (2.168 BPW)

Not recommended, but should be faster and better quality than the IQ1_S and okay with full offload on multi-GPU. It should be okay for hybrid CPU+GPU inference as well if this size is good for your rig. Probably want to choose the IQ2_KT for full GPU offload.

Special mix IQ2_XXS ffn_(gate|up)_exps and IQ2_KS ffn_down_exps routed experts. Mostly iq4_ks/iq3_ks for attn and shared expert. iq4_k token_embd and iq5_k output "head".

Final estimate: PPL = 4.0078 +/- 0.02291

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q4_0
blk\.[0-2]\.attn_.*=iq4_ks
blk\.[0-2]\.ffn_down.*=iq4_ks
blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
blk\.[0-2]\..*=iq4_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q4_0
blk\.[1-5][0-9]\.attn_k_b.*=q4_0
blk\.60\.attn_k_b.*=q4_0

blk\.[3-9]\.attn_.*=iq4_ks
blk\.[1-5][0-9]\.attn_.*=iq4_ks
blk\.60\.attn_.*=iq4_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.60\.ffn_down_shexp\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq2_ks
blk\.60\.ffn_down_exps\.weight=iq2_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_xxs

# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq5_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ2_XXS.gguf \
    IQ2_XXS \
    24

* IQ1_S 132.915 GiB (1.699 BPW)

Not recommended. "For the desperate". If you can fit a larger model in RAM+VRAM choose a larger model as it might even run faster and will definitely have better perplexity (likely better quality).

Special mix IQ1_S ffn_(gate|up)_exps and IQ1_M ffn_down_exps routed experts. Mostly iq4_ks/iq3_ks for attn and shared expert. iq4_k token_embd and iq5_k output "head".

Final estimate: PPL = 4.9878 +/- 0.02999

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q4_0
blk\.[0-2]\.attn_.*=iq4_ks
blk\.[0-2]\.ffn_down.*=iq4_ks
blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
blk\.[0-2]\..*=iq4_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q4_0
blk\.[1-5][0-9]\.attn_k_b.*=q4_0
blk\.60\.attn_k_b.*=q4_0

blk\.[3-9]\.attn_.*=iq4_ks
blk\.[1-5][0-9]\.attn_.*=iq4_ks
blk\.60\.attn_.*=iq4_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.60\.ffn_down_shexp\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq1_m
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq1_m
blk\.60\.ffn_down_exps\.weight=iq1_m

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq1_s
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq1_s
blk\.60\.ffn_(gate|up)_exps\.weight=iq1_s

# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq5_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-256x21B-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-IQ1_S.gguf \
    IQ1_S \
    24

Quick Start

## clone latest ik_llama.cpp
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp

## build for hybrid CUDA and CPU DeepSeek inferencing
# apt-get install build-essential cmake ccache nvidia-cuda-toolkit # plus anything you need
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)

## Run api server
./build/bin/llama-server \
    --model /models/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-TNG-R1T2-Chimera-IQ3_KS \
    -fa \
    -mla 3 -fmoe -amb 512 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -ngl 99 \
    -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 1024 -b 2048 \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

Adjust --threads to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying --threads and --threads-batch for larger server rigs.

If you OOM on VRAM, remove the additional -ot "...=CUDA0" or you can increase offload layers if you have more VRAM with multi-GPU targets e.g. -ot "blk\.(5|6)\.ffn_.*=CUDA1" \.

Test out -rtr to run-time-repack tensors to _r4 variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup.

Generally -ub 2048 -b 2048 or -ub 4096 -b 4096 can give much faster PP speeds at the cost of some additional VRAM. Test against leavng it at the default -ub 512 -b 2048.

Use llama-sweep-bench --warmup-batch ... to benchmark various configurations with your hardware to report to the community!

TODO

  • Given the IQ1_S_R4 is not symmetric with IQ1_S it doesn't work with -rtr so I might look into releasing an _R4 variant after some llama-sweep-bench testing.
  • Consider a slightly larger model? (gotta free up some disk space lol)

References