ik_llama.cpp
imatrix Quantizations of Hunyuan-A13B-Instruct
This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quants
IQ3_KS
34.088 GiB (3.642 BPW)
Special mix IQ4_KS
ffn_down
and all new IQ3_KS
ffn_(up|gate)
routed experts. iq6_k/iq5_k
for attn and shared expert as shown in the recipe below. Test out -rtr
to run-time-repack tensors to _r4
variants layers when running on CPU/RAM likely faster in default ubatch sizes.
With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed!
Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB.
More context or offload additional layers with extra VRAM.
👈 Secret Recipe
custom="
# Attention
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_o.*=iq5_k
# 1x Shared Expert
blk\..*\.ffn_(down)_shexp.*=iq6_k
blk\..*\.ffn_(gate|up)_shexp.*=iq5_k
# 64x Routed Experts
blk\..*\.ffn_(down)_exps.*=iq4_ks
blk\..*\.ffn_(gate|up)_exps.*=iq3_ks
# Token Embedding
token_embd\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \
/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \
/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
IQ3_KS \
24
Quick Start
16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference
# Basically trade-off VRAM between longer context or more speed for your configuration.
./build/bin/llama-server \
--model /mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \
--alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \
-fa -fmoe \
-rtr \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
--temp 0.6 \
--presence-penalty 0.7 \
--min-p 0.1 \
-ngl 99 \
-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
-ot exps=CPU \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8083
Perplexity
The perplexity on these Hunyuan-A13B-Instruct models seems really high compared to stuff I've seen before. Check out the mainline llama.cpp PR14425 for more details.
IQ3_KS
34.088 GiB (3.642 BPW)Final estimate: PPL = 522.7473 +/- 5.68072
Speed
Used built in llama-sweep-bench
tool for example speeds across a variety of context length chats (N_KV is the kv-cache depth used for generation).
llama-sweep-bench
# Offload 15 total layers and increase ubatch from default of -ub 512 up to -ub 2048 for big PP!
export model=/mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-fa -fmoe \
-rtr \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
-ngl 99 \
-ot "blk\.([0-9])\.ffn_.*=CUDA0" \
-ot "blk\.(1[0-4])\.ffn_.*=CUDA0" \
-ub 2048 -b 2048 \
-ot exps=CPU \
--threads 16 \
--warmup-batch
NOTE Building Experimental PRs
This PR is based on currently un-released PRs so is quite experimental. To build it before PRs are merged try something like this:
# get the code setup
cd projects
git clone https://github.com/ikawrakow/ik_llama.cpp.git
git ik_llama.cpp
git remote add ubergarm https://github.com/ubergarm/ik_llama.cpp
git fetch ubergarm
git checkout ug/hunyuan-moe-2
# build for CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# clean up later if things get merged into main
git checkout main
git branch -D merge-stuff-here
VRAM Estimations
Context length = VRAM use:
- 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
- 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB
- 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
- 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB
ROPE Considerations
The rope-freq-base defaults to about 11 million 11158840
but can be adjusted down to possibly better match shorter context applications.
# adjust to 3 million
--rope-freq-base 3000000
Thanks to @kooshi for this tip with which you can experiment.
References
- Downloads last month
- 875
Model tree for ubergarm/Hunyuan-A13B-Instruct-GGUF
Base model
tencent/Hunyuan-A13B-Instruct