ubergarm/Hunyuan-A13B-Instruct-GGUF

`ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quants

`IQ3_KS` 34.088 GiB (3.642 BPW)

Special mix IQ4_KS ffn_down and all new IQ3_KS ffn_(up|gate) routed experts. iq6_k/iq5_k for attn and shared expert as shown in the recipe below. Test out -rtr to run-time-repack tensors to _r4 variants layers when running on CPU/RAM likely faster in default ubatch sizes.

With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!

Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.

More context or offload additional layers with extra VRAM.

👈 Secret Recipe

custom="
# Attention
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_o.*=iq5_k

# 1x Shared Expert
blk\..*\.ffn_(down)_shexp.*=iq6_k
blk\..*\.ffn_(gate|up)_shexp.*=iq5_k

# 64x Routed Experts
blk\..*\.ffn_(down)_exps.*=iq4_ks
blk\..*\.ffn_(gate|up)_exps.*=iq3_ks

# Token Embedding
token_embd\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \
    /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \
    /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
    IQ3_KS \
    24

Quick Start

16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference

# Basically trade-off VRAM between longer context or more speed for your configuration.
./build/bin/llama-server \
  --model /mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
  --alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \
  -fa -fmoe \
  -rtr \
  -ctk q8_0 -ctv q8_0 \
  -c 32768 \
  --temp 0.6 \
  --presence-penalty 0.7 \
  --min-p 0.1 \
  -ngl 99 \
  -ot "blk\.([0-9])\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  --parallel 1 \
  --threads 16 \
  --host 127.0.0.1 \
  --port 8083

Perplexity

The perplexity on these Hunyuan-A13B-Instruct models seems really high compared to stuff I've seen before. Check out the mainline llama.cpp PR14425 for more details.

IQ3_KS 34.088 GiB (3.642 BPW) Final estimate: PPL = 522.7473 +/- 5.68072

Speed

Used built in llama-sweep-bench tool for example speeds across a variety of context length chats (N_KV is the kv-cache depth used for generation).

llama-sweep-bench

# Offload 15 total layers and increase ubatch from default of -ub 512 up to -ub 2048 for big PP!
export model=/mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa -fmoe \
  -rtr \
  -ctk q8_0 -ctv q8_0 \
  -c 32768 \
  -ngl 99 \
  -ot "blk\.([0-9])\.ffn_.*=CUDA0" \
  -ot "blk\.(1[0-4])\.ffn_.*=CUDA0" \
  -ub 2048 -b 2048 \
  -ot exps=CPU \
  --threads 16 \
  --warmup-batch

NOTE Building Experimental PRs

This PR is based on currently un-released PRs so is quite experimental. To build it before PRs are merged try something like this:

# get the code setup
cd projects
git clone https://github.com/ikawrakow/ik_llama.cpp.git
git ik_llama.cpp
git remote add ubergarm https://github.com/ubergarm/ik_llama.cpp
git fetch ubergarm
git checkout ug/hunyuan-moe-2

# build for CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# clean up later if things get merged into main
git checkout main
git branch -D merge-stuff-here

VRAM Estimations

Context length = VRAM use:

8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB

ROPE Considerations

The rope-freq-base defaults to about 11 million 11158840 but can be adjusted down to possibly better match shorter context applications.

# adjust to 3 million
--rope-freq-base 3000000

Thanks to @kooshi for this tip with which you can experiment.

ubergarm
/

Hunyuan-A13B-Instruct-GGUF

`ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct

Big Thanks

Quants

`IQ3_KS` 34.088 GiB (3.642 BPW)

Quick Start

16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference

Perplexity

Speed

llama-sweep-bench

NOTE Building Experimental PRs

VRAM Estimations

ROPE Considerations

References

Model tree for ubergarm/Hunyuan-A13B-Instruct-GGUF

ik_llama.cpp imatrix Quantizations of Hunyuan-A13B-Instruct

Big Thanks

Quants

IQ3_KS 34.088 GiB (3.642 BPW)

Quick Start

16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference

Perplexity

Speed

llama-sweep-bench

NOTE Building Experimental PRs

VRAM Estimations

ROPE Considerations

References

Model tree for ubergarm/Hunyuan-A13B-Instruct-GGUF

`ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct

`IQ3_KS` 34.088 GiB (3.642 BPW)