Q6 K
Hey, by any chance could you also create a Q6 K file with this format please?
I'm thinking about making two more possibly, e.g. one that is a bit heavier and one that is a bit lighter. I've been testing the -mix-IQ3_K
and it seems really good running locally, on par with DeepSeek-V3-0324
in my anecdotal opinion, but of course it is reasoning so takes a bit longer.
The current -mix-IQ3_k
also barely fits on my rig, so I have to close my firefox browser to free up enough RAM and I'm running a super lean ARCH Linux + xwindows + dwm tiling window manager + alacritty terminal setup. So having a leaner version could be handy as most folks will likely have to run headless or have a little swap space setup to hold their browser RAM haha...
Any specific VRAM+RAM breakpoints you're working with regarding a possible IQ6_K
version? I'd probably go full Q8_0
for all attention layers as they are pretty small, then do IQ6_K/IQ5_K for gate/(up|down) or similar...
Is it worth to move to this from IQ4_XS? Those extra flags like fmoe and rtr have gained me 87.79t/s PP and 9.97t/s. In regular llama.cpp I only get half of that and was going to get smaller unsloth quant but they keep updating it and breaking my downloads.
How fast do your small deepseek run compared to qwen? I know jack about offloading compared to just doing GPU inference so much to learn.
Is it worth to move to this from IQ4_XS?
If you can fit this -mix-IQ3_K
in your rig, I can definitely recommend this over the unsloth UD-Q3_K_XL
which is of similar size class. I don't have numbers on the unsloth IQ4_XS
, but am working on more benchmarks now and hope to do a post on r/LocalLLaMA
soon :tm:.
Here is a sneak peek of what I have already:
Interestingly, bartowski's Qwen3-30B-A3B
~4bpw quants are looking very competitive, hope to work more on that model soon :tm: too!
How fast do your small deepseek run compared to qwen?
In my limited testing on my local rig, I'd choose my Qwen3-235B-A22B-mix-IQ3_K
every time now over my DeepSeek-V3-0324-IQ2_K_R4
or unreleased DeepSeek-R1-GGUF-Q2_K_R4
. Qwen3 is much faster and also the quality feels in limited testing better than V3-324 at least and on par or possibly better than smaller R1
quants at least for coding type tasks.
Here is the speed graph running my Qwen3-235B-A22B-mix-IQ3_K
locally on a 3090TI FE 24GB VRAM + AMD 9950X 2x48GB DDR5-6400 rig.
Also ik may be working on more improvements for GQA FA implementation on his fork which could improve speed even more possibly for Qwen3 and similar models.
Seeing that KLD, I'm glad I didn't waste time with the other quant. D/S is better for creative tasks, unfortunately. I saw surprisingly decent speeds in the ik_llama discussions. My assumption would have been 2-3t/s at best without fancy new generation xeons.
The IQ4 gives me similar outputs to the API, if this does too and generates faster it would be a win.
Sorry to chime up here, but any possibility of Q4_K? I could fit it into my PC using ~20GB of RAM.
Now I'm downloading this one and it should fit fully on VRAM on my case. There wouldn't be any issues by using only CUDA?
EDIT: Tested on full CUDA and working fine! Pretty nice results while testing the model.
@Lockout keep us posted, I'd love to hear if this meets your quality expectations! I've been impressed with it so far.
@Panchovix
oh hey you have all the GPUs yes! Correct, I thought of you while making this model and did not repack the quants to _R4
myself to allow a wider variety of VRAM+RAM combinations to work out of the box. If someone wants to run on RAM they can use -rtr
or the offline rapack tool themselves easily enough without downloading anything more.
If anyone is interested, I have some limited benchmarks for speed and quality on my fresh new Qwen3-30B-A3B-mix-IQ4_K.gguf and hit over 1600 tok/sec PP and 105 tok/sec TG peak on my 3090 TI FE 24GB VRAM!
Heh, it's almost done downloading even, an hour left. I did not find any benefit with ik_llama for full gpu inference. It was slower than mainline. maybe it's different if you pass -fmoe and some of the other flags but dense models lost t/s. Come to think of it.. I can no-shit fully offload this quant too. If I recruit my 5th gpu will have 118gb. Also could install 1 or 2 24gb P40 or a P100.. power consumption not worth it though.
Also getting curious about THP and if that will help. It says to run it without mmap so -rtr is fine.. but do I then turn off the rtr to "benefit". From the issue it says to clear caches when switching so the weights will load from HDD once again. I have dual socket with 1 numa node per. Maybe q3/q4 of qwen is too small to need any of that?
On 128GB VRAM (5090+4090x2+A6000) but slow PCIe, I get lower speeds vs main yes but it's still pretty fast.
At X8/X8/X4/X4, I get 500 t/s pp and 23 t/s while generating (iq3 ikllamacpp)
On main I get same pp but 28t/s while generating (ud q3_k_xl)
On ud q4_k_xl with CPU offloading (20GB ram or so) I get 300 pp and 20 t/s while generating on main llamacpp.
This quant I see some 11.x output token speeds so it's slightly faster.
run it like this with 32k:
--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-amb 1024 \
-ot "(1[0-9]).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" \
The ubatch increases PP speed over t/s and makes bigger buffers on cuda. I don't know if amb 1024/512 makes a difference.
PP is now over 100.
ok.. update as to quality....
So here I am torn, the model is more likely to not know what mesugaki means in IQ3, but also reloading the IQ4 has caused it to screw up often as well. Had this strange issue where I got better outputs when I offloaded more to CPU, repetition got to be less as well. I was testing CPU only inference with cuda PP. Now that I run it more, the IQ4 and IQ3 are both printing fairly similar t/s too.
Thanks for all the testing and results all!
@Lockout
I just saw that
@ArtusDev
released what looks like some kind of IQ6_K
version here: https://huggingface.co/ArtusDev/Qwen3-235B-A22B-GGUF (but I have not tested it myself). Looking at the model card sidebar it suggests they are using Q8_0
for ffn_down_exps
which is a bit surprising to me for an IQ6_K
. I don't see the exact recipe they used however, and unfortunately huggingface doesn't recognize iqN_k
quants properly in the gguf dump side-bar... Theirs is almost double the size of this one, maybe enough bits to "know what mesugaki means" ? haha
TIL: i just looked it up myself....
I was testing some offload commands way too late into the morning with some folks on a discord called "beaver ai club" possibly run by "thedrummer" (i think lol). anyway 59smoke and Aes Sedai and I worked out a better starting place for multi-gpu tensor-override for this model e.g.
./build/bin/llama-server \
--model /mnt/models/ubergarm/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
--alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
-fmoe \
-amb 512 \
-rtr \
-ngl 99 \
-ts 24,24 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn.*=CUDA0" \
-ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26)\.ffn.*=CUDA1" \
-ot "ffn.*=CPU" \
--threads 8 \
--host 127.0.0.1 \
--port 8080
We saw a lot of speeds-ups by removing that *_exps
part and matching all ffn
layers otherwise some of the ffn tensors would be on GPU and some on CPU which was hurting performance by about 50% it seemed.
Seems like a 24GB GPU can hold maybe ~12-14ish layers or so depending on context and kv cache quantization etc...
EDIT: Also for multi GPU setups look into compiling with -DGGML_SCHED_MAX_COPIES=1
(default is 4) which may free up some VRAM etc.
i gotta update the model card once this gets ironed out a it better. thanks!
Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.
I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.
Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843
Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.
Yes, you can run all normal mainline quantizations with ik_llama.cpp
and take advantage of the -rtr
and other optimizations as well without specifically needing the iqN_k
quants. Check out these sweet benchmarks and commands by
@AesSedai
comparing some runs with that model:
https://github.com/ikawrakow/ik_llama.cpp/discussions/357#discussioncomment-13020187
I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.
Right, I haven't figured out exactly what is going on but there seems to be something that happens mixing -ts
and -ot
with multi-gpu setups that makes it difficult to understand how things will get allocated often leading to uneven distribution, OOMs, and such. So we started being more explicit about using -ot
for more layers and it seemed to run more balanced across multi GPUs and show faster performance more inline with expectations.
I don't have 2x GPUs on my home rig to easily try and my remote rig is busy with some benchmarks for now.
Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843
You're a demanding customer xD loljk, so I'd suggest opening an issue on ik_llama.cpp and look at recent PRs like the one I did for GLM-4 to see how to adapt model loading, building cuda graph, and other misc bits for adding a new architecture. Given nemotron 253b looks like a dense model, personally I'd hold off just a little bit until ik reworks attention going on here now: https://github.com/ikawrakow/ik_llama.cpp/pull/370
After that settles down a bit, then might be a good time to add more arch's, just my two cents.
Cheers!
Haha many thanks for all the info!
Okay just wanted to update, but the way to load the experts with each number is amazing! Could load Q6_K_M, on normal llamacpp with good usage with
./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 32768 --no-mmap --no-warmup -v -ngl 999 -fa -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU"
Which is about 21GB on each 4090, 30GB on the 5090 and 44GB on the A6000.
And got these speeds
prompt eval time = 57152.69 ms / 3877 tokens ( 14.74 ms per token, 67.84 tokens per second)
eval time = 38705.90 ms / 318 tokens ( 121.72 ms per token, 8.22 tokens per second)
Not the best, not the worst, but pretty usable for coding! Now I have to test ikllamacpp to see if I get better speeds when offloading, but man separating the experts by each number is magic.
Okay ran ikllamacpp with similar command but added -fmoe, -amb 512 and -rtr, so it looks like this
./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 999 -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU" -fmoe -amb 512 -rtr
And got a huge jump in PP performance
INFO [ print_timings] prompt eval time = 39663.05 ms / 3877 tokens ( 10.23 ms per token, 97.75 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_prompt_processing=39663.052 n_prompt_tokens_processed=3877 t_token=10.230346143925717 n_tokens_second=97.74840322424002
INFO [ print_timings] generation eval time = 102110.03 ms / 825 runs ( 123.77 ms per token, 8.08 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_token_generation=102110.027 n_decoded=825 t_token=123.7697296969697 n_tokens_second=8.079519947634525
Basically 47-50% faster!
EDIT: Try 2
INFO [ print_timings] prompt eval time = 36897.66 ms / 3877 tokens ( 9.52 ms per token, 105.07 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_prompt_processing=36897.659 n_prompt_tokens_processed=3877 t_token=9.517064482847562 n_tokens_second=105.07441678075024
INFO [ print_timings] generation eval time = 143560.31 ms / 1197 runs ( 119.93 ms per token, 8.34 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_token_generation=143560.31 n_decoded=1197 t_token=119.93342522974102 n_tokens_second=8.337959147622348
Pretty nice.
I've been using llama-sweep-bench and found GGML_SCHED_MAX_COPIES=1 doesn't really help. You get higher t/s but slower PP. Many other params have this tradeoff. Better results with AMB of 512 than 1024. -ub 1024 worked best.
Will try to do just complete FFN layers. Way to make me re-test again.
iq3 with MAX_COPIES=1
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 95, n_threads = 28, n_threads_batch = 28
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 11.519 | 88.90 | 17.821 | 14.37 |
1024 | 256 | 1024 | 10.921 | 93.76 | 17.361 | 14.75 |
1024 | 256 | 2048 | 11.279 | 90.79 | 18.501 | 13.84 |
1024 | 256 | 3072 | 11.268 | 90.87 | 18.989 | 13.48 |
1024 | 256 | 4096 | 11.124 | 92.05 | 19.846 | 12.90 |
1024 | 256 | 5120 | 11.005 | 93.05 | 20.465 | 12.51 |
1024 | 256 | 6144 | 11.266 | 90.89 | 21.227 | 12.06 |
1024 | 256 | 7168 | 11.235 | 91.14 | 22.048 | 11.61 |
1024 | 256 | 8192 | 11.551 | 88.65 | 22.873 | 11.19 |
1024 | 256 | 9216 | 11.534 | 88.78 | 23.690 | 10.81 |
1024 | 256 | 10240 | 11.545 | 88.70 | 24.413 | 10.49 |
1024 | 256 | 11264 | 11.522 | 88.87 | 25.077 | 10.21 |
1024 | 256 | 12288 | 11.607 | 88.23 | 26.143 | 9.79 |
1024 | 256 | 13312 | 11.703 | 87.50 | 26.454 | 9.68 |
1024 | 256 | 14336 | 11.880 | 86.20 | 27.781 | 9.22 |
1024 | 256 | 15360 | 11.804 | 86.75 | 27.923 | 9.17 |
1024 | 256 | 16384 | 11.926 | 85.86 | 29.411 | 8.70 |
vs
-ot "\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU"
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 9.406 | 108.87 | 19.733 | 12.97 |
1024 | 256 | 1024 | 9.044 | 113.23 | 19.282 | 13.28 |
1024 | 256 | 2048 | 9.103 | 112.49 | 19.568 | 13.08 |
1024 | 256 | 3072 | 9.230 | 110.94 | 20.020 | 12.79 |
1024 | 256 | 4096 | 9.330 | 109.75 | 21.192 | 12.08 |
1024 | 256 | 5120 | 9.285 | 110.28 | 21.932 | 11.67 |
1024 | 256 | 6144 | 9.331 | 109.75 | 22.542 | 11.36 |
1024 | 256 | 7168 | 9.635 | 106.28 | 23.735 | 10.79 |
1024 | 256 | 8192 | 9.540 | 107.33 | 24.221 | 10.57 |
1024 | 256 | 9216 | 9.896 | 103.48 | 25.540 | 10.02 |
1024 | 256 | 10240 | 9.931 | 103.12 | 25.744 | 9.94 |
1024 | 256 | 11264 | 9.852 | 103.94 | 27.056 | 9.46 |
1024 | 256 | 12288 | 9.959 | 102.82 | 27.363 | 9.36 |
1024 | 256 | 13312 | 9.900 | 103.43 | 28.057 | 9.12 |
1024 | 256 | 14336 | 10.082 | 101.57 | 28.988 | 8.83 |
1024 | 256 | 15360 | 10.252 | 99.88 | 29.665 | 8.63 |
1024 | 256 | 16384 | 10.381 | 98.64 | 30.715 | 8.33 |
1024 | 256 | 17408 | 10.377 | 98.68 | 31.747 | 8.06 |
1024 | 256 | 18432 | 10.496 | 97.56 | 32.407 | 7.90 |
1024 | 256 | 19456 | 10.405 | 98.42 | 33.066 | 7.74 |
1024 | 256 | 20480 | 10.678 | 95.90 | 34.071 | 7.51 |
1024 | 256 | 21504 | 10.622 | 96.40 | 34.884 | 7.34 |
1024 | 256 | 22528 | 10.793 | 94.88 | 35.753 | 7.16 |
1024 | 256 | 23552 | 10.855 | 94.34 | 36.423 | 7.03 |
1024 | 256 | 24576 | 11.138 | 91.94 | 37.135 | 6.89 |
1024 | 256 | 25600 | 11.020 | 92.92 | 37.695 | 6.79 |
1024 | 256 | 26624 | 11.241 | 91.09 | 38.460 | 6.66 |
1024 | 256 | 27648 | 11.156 | 91.79 | 39.634 | 6.46 |
1024 | 256 | 28672 | 11.297 | 90.64 | 40.637 | 6.30 |
1024 | 256 | 29696 | 11.609 | 88.21 | 41.458 | 6.17 |
1024 | 256 | 30720 | 11.420 | 89.66 | 41.816 | 6.12 |
1024 | 256 | 31744 | 11.560 | 88.58 | 42.828 | 5.98 |
-ot "(1[0-9]|39).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU"
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 9.455 | 108.30 | 20.046 | 12.77 |
1024 | 256 | 1024 | 9.044 | 113.23 | 19.252 | 13.30 |
1024 | 256 | 2048 | 9.134 | 112.11 | 19.727 | 12.98 |
1024 | 256 | 3072 | 9.173 | 111.63 | 20.501 | 12.49 |
1024 | 256 | 4096 | 9.157 | 111.82 | 21.064 | 12.15 |
1024 | 256 | 5120 | 9.322 | 109.85 | 22.093 | 11.59 |
1024 | 256 | 6144 | 9.289 | 110.24 | 22.626 | 11.31 |
1024 | 256 | 7168 | 9.510 | 107.67 | 23.796 | 10.76 |
1024 | 256 | 8192 | 9.641 | 106.21 | 24.726 | 10.35 |
1024 | 256 | 9216 | 9.674 | 105.85 | 25.821 | 9.91 |
1024 | 256 | 10240 | 9.857 | 103.88 | 26.529 | 9.65 |
1024 | 256 | 11264 | 9.906 | 103.37 | 27.412 | 9.34 |
1024 | 256 | 12288 | 10.087 | 101.52 | 28.002 | 9.14 |
1024 | 256 | 13312 | 9.963 | 102.78 | 28.809 | 8.89 |
1024 | 256 | 14336 | 10.214 | 100.25 | 29.980 | 8.54 |
1024 | 256 | 15360 | 10.263 | 99.78 | 30.997 | 8.26 |
1024 | 256 | 16384 | 10.286 | 99.56 | 31.577 | 8.11 |
1024 | 256 | 17408 | 10.511 | 97.42 | 32.338 | 7.92 |
1024 | 256 | 18432 | 10.451 | 97.98 | 32.650 | 7.84 |
1024 | 256 | 19456 | 10.491 | 97.61 | 33.754 | 7.58 |
1024 | 256 | 20480 | 10.703 | 95.67 | 33.956 | 7.54 |
1024 | 256 | 21504 | 10.707 | 95.64 | 34.782 | 7.36 |
1024 | 256 | 22528 | 10.773 | 95.05 | 35.988 | 7.11 |
1024 | 256 | 23552 | 10.946 | 93.55 | 36.824 | 6.95 |
1024 | 256 | 24576 | 11.020 | 92.92 | 37.100 | 6.90 |
1024 | 256 | 25600 | 10.987 | 93.20 | 38.272 | 6.69 |
1024 | 256 | 26624 | 11.166 | 91.71 | 39.116 | 6.54 |
1024 | 256 | 27648 | 11.420 | 89.67 | 40.111 | 6.38 |
1024 | 256 | 28672 | 11.370 | 90.06 | 41.202 | 6.21 |
1024 | 256 | 29696 | 11.510 | 88.97 | 41.707 | 6.14 |
1024 | 256 | 30720 | 11.573 | 88.48 | 42.415 | 6.04 |
1024 | 256 | 31744 | 11.530 | 88.82 | 42.722 | 5.99 |
Also.. for some reason I have to set -ngl lower.. like 93/94 instead of 95. Otherwise it doesn't fill the GPUs but tries to allocate massive buffers while taking much fewer layers. It asked for 9GB+ even and it's not kv or the compute buffer.
I got it to load at 95 and perf is much worse
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 21.511 | 47.60 | 31.077 | 8.24 |
1024 | 256 | 1024 | 21.014 | 48.73 | 30.145 | 8.49 |
vs
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 10.069 | 101.70 | 20.471 | 12.51 |
1024 | 256 | 1024 | 9.933 | 103.09 | 19.225 | 13.32 |
Offloading sequential layers makes things more consistent but not necessarily faster.
@Lockout
the reason for GGML_SCHED_MAX_COPIES
is because ik_llama.cpp will try to duplicate the VRAM assignment if there is more than one GPU involved:
# ref: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L20201-L20216
// enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary
bool pipeline_parallel =
llama_get_device_count(*model) > 1 &&
model->n_gpu_layers > (int)model->hparams.n_layer &&
model->split_mode == LLAMA_SPLIT_MODE_LAYER &&
params.offload_kqv;
#ifndef GGML_USE_CUDA
// pipeline parallelism requires support for async compute and events
// currently this is only implemented in the CUDA backend
pipeline_parallel = false;
#endif
ctx->sched = ggml_backend_sched_new(ctx->backends.data(), backend_buft.data(), ctx->backends.size(), max_nodes, pipeline_parallel);
if (pipeline_parallel) {
LLAMA_LOG_INFO("%s: pipeline parallelism enabled (n_copies=%d)\n", __func__, ggml_backend_sched_get_n_copies(ctx->sched));
}
and at least for me, with the 235B-A22B, I'm already loading up the entirety of my VRAM so I don't have 3x as much VRAM to spare for the parallelism. It's less a speed thing and more of a "I want the model to load at all, please" thing.
@AesSedai is correct. I was trying to use it as default, but then it tries to copy some buffers and then I get OOM. Specially a big buffer on the A6000 of 10GB.
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 77572.64 MiB
llm_load_tensors: CUDA_Host buffer size = 486.86 MiB
llm_load_tensors: CUDA0 buffer size = 18032.50 MiB
llm_load_tensors: CUDA1 buffer size = 18032.50 MiB
llm_load_tensors: CUDA2 buffer size = 25879.55 MiB
llm_load_tensors: CUDA3 buffer size = 44064.14 MiB
...
llama_kv_cache_init: CUDA0 KV buffer size = 1152.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1152.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 1472.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 2240.00 MiB
llama_new_context_with_model: KV self size = 6016.00 MiB, K (f16): 3008.00 MiB, V (f16): 3008.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2386.63 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 1056.51 MiB
ggml_gallocr_reserve_n: reallocating CUDA2 buffer from size 0.00 MiB to 2732.50 MiB
ggml_gallocr_reserve_n: reallocating CUDA3 buffer from size 0.00 MiB to 10432.71 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 10432.71 MiB on device 3: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 10939492352
llama_new_context_with_model: failed to allocate compute buffers
So that's the mystery. Turning off the copies must be what ngl 94 is accomplishing.
llm_load_tensors: offloaded 94/95 layers to GPU
llm_load_tensors: CPU buffer size = 19656.00 MiB
llm_load_tensors: CUDA_Host buffer size = 1261.20 MiB
llm_load_tensors: CUDA0 buffer size = 22148.27 MiB
llm_load_tensors: CUDA1 buffer size = 22089.93 MiB
llm_load_tensors: CUDA2 buffer size = 22148.27 MiB
llm_load_tensors: CUDA3 buffer size = 22089.93 MiB
....................................................................................................
============ Repacked 55 tensors
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 782.01 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 782.01 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 416.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 609.50 MiB
btw.. holy crap.. pull the new FA fixes.
This is such a beautiful discussion lmao, <3 y'all! I'll send folks over here as they embark on their multi-GPU tensor offload journey! haha
Speaking of discussions.. who tried https://huggingface.co/MikeRoz/Qwen3-235B-A22B-exl2 and how it compares.
new FA IQ3_K
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 9.722 | 105.33 | 20.006 | 12.80 |
1024 | 256 | 1024 | 9.149 | 111.92 | 19.087 | 13.41 |
1024 | 256 | 2048 | 9.280 | 110.34 | 18.442 | 13.88 |
1024 | 256 | 3072 | 9.148 | 111.94 | 18.475 | 13.86 |
----snip | ||||||
1024 | 256 | 28672 | 10.278 | 99.63 | 24.305 | 10.53 |
1024 | 256 | 29696 | 10.497 | 97.55 | 24.513 | 10.44 |
1024 | 256 | 30720 | 10.362 | 98.83 | 24.780 | 10.33 |
1024 | 256 | 31744 | 10.314 | 99.28 | 25.245 | 10.14 |
new FA IQ4_XS
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 10.544 | 97.12 | 22.129 | 11.57 |
1024 | 256 | 1024 | 9.993 | 102.48 | 19.998 | 12.80 |
1024 | 256 | 2048 | 10.041 | 101.98 | 19.028 | 13.45 |
1024 | 256 | 3072 | 9.788 | 104.62 | 18.866 | 13.57 |
----snip | ||||||
1024 | 256 | 28672 | 11.045 | 92.71 | 25.269 | 10.13 |
1024 | 256 | 29696 | 11.060 | 92.58 | 25.101 | 10.20 |
1024 | 256 | 30720 | 11.059 | 92.60 | 25.289 | 10.12 |
1024 | 256 | 31744 | 11.196 | 91.46 | 26.192 | 9.77 |
How to test KLD/perplexity painlessly. 6_K is probably too big of a speed drop, but these are almost identical. mainline llama.cpp doesn't even have the sweep bench to compare speeds.
@Lockout @ubergarm has most of the sweep-bench implementation for llama.cpp on this branch: https://github.com/ggml-org/llama.cpp/compare/master...ubergarm:llama.cpp:ug/port-sweep-bench
You can pull / cherrypick those and recompile llama.cpp to get sweep-bench there too.
How to test KLD/perplexity painlessly.
I have slowly collected some PPL and KLD numbers here: https://github.com/ikawrakow/ik_llama.cpp/discussions/359#discussioncomment-13009539
Perplexity is as easy and inferencing with the model. But KLD is tricky as it make a big file and you ideally want to get the baseline using the full BF16 which may not be easy as it is over 400GB.
Example perplexity run for full offload:
./build/bin/llama-perplexity \
-m "$model" \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
Example KLD first pass to generate KLD base data file from bf16 (or q8_0 if that is the biggest you can fit). Using smaller fully offload example here, adjust with your exact arguments for a given bigger model etc:
model=/mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
Example KLD second pass using data file from above to test KLD of smaller models vs baseline.
model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
--kl-divergence \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
The KLD run will also give you PPL for another data point with a different corpus.
Just made a reddit posts with some metrics if you guys are interested https://www.reddit.com/r/LocalLLaMA/comments/1kezq68/speed_metrics_running_deepseekv3_0324qwen3_235b/
There posted the speeds of offloading!
Ok.. got perplexity working...
235b ubergarm iq3 : Final estimate: PPL = 3.8092 +/- 0.03584
235b IQ4_XS : Final estimate: PPL = 3.7938 +/- 0.03551
Using calibration_data_v5_rc.txt
If someone has a base/dataset for the 235b I can compare kld.
@Panchovix nice looks like a lot of folks are interested in multi-gpu with big models like you've been testing, thanks for sharing and spreading the word with your updated commands!
@Lockout oh glad you got it running, I've been running a bunch of ppl/kld myself lately and hoping to release some data soon (maybe later today) if I can make some graphs.
I've been using wiki.test.raw
wiki.test.raw.gz (make sure to gunzip it) for perplexity test and my own ubergarm-kld-test-corpus.txt
(hopefully novel [never been trained on] data i got using whisper transcripts from a podcast, i've described it elsewhere better).
I think its best to run the tests against a dataset different than whatever people are using for imatrix calibration.
And right for the baseline 235B first pass I had to run the Q8_0
as I didn't have enough RAM+VRAM. Its a big file.
I have some limited numbers from earlier and there is good discussion here on the challenges given PPL is lower than BF16 for the 30B for some quants: https://github.com/ikawrakow/ik_llama.cpp/discussions/359
It's someone else's calibration dataset. Not the one you or unsloth used. I will see what happens on wiki too. I've got 384gb of ram but my internet is way too slow to try to make my own quants. Takes overnight and into the next day to even get these. Deepseek will probably take me 2 days in Q2 form.
What a difference only a few tiny layers make?!
A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn.=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn.=CUDA3"
-ot "ffn.*=CPU"
llm_load_tensors: CPU buffer size = 31876.41 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21717.16 MiB
llm_load_tensors: CUDA1 buffer size = 21680.71 MiB
llm_load_tensors: CUDA2 buffer size = 21717.16 MiB
llm_load_tensors: CUDA3 buffer size = 21680.71 MiB
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 9.337 | 109.68 | 20.979 | 12.20 |
1024 | 256 | 1024 | 9.049 | 113.16 | 17.932 | 14.28 |
1024 | 256 | 2048 | 8.915 | 114.87 | 17.710 | 14.45 |
1024 | 256 | 3072 | 9.015 | 113.59 | 17.950 | 14.26 |
1024 | 256 | 4096 | 9.130 | 112.16 | 18.154 | 14.10 |
1024 | 256 | 5120 | 9.124 | 112.23 | 18.203 | 14.06 |
1024 | 256 | 6144 | 9.217 | 111.10 | 19.760 | 12.96 |
1024 | 256 | 7168 | 9.202 | 111.28 | 18.715 | 13.68 |
1024 | 256 | 8192 | 9.548 | 107.24 | 19.221 | 13.32 |
1024 | 256 | 9216 | 9.303 | 110.07 | 19.298 | 13.27 |
1024 | 256 | 10240 | 9.411 | 108.81 | 19.781 | 12.94 |
1024 | 256 | 11264 | 9.335 | 109.70 | 19.705 | 12.99 |
1024 | 256 | 12288 | 9.496 | 107.83 | 20.257 | 12.64 |
1024 | 256 | 13312 | 9.540 | 107.34 | 20.536 | 12.47 |
1024 | 256 | 14336 | 9.619 | 106.46 | 20.685 | 12.38 |
1024 | 256 | 15360 | 9.578 | 106.91 | 21.045 | 12.16 |
1024 | 256 | 16384 | 9.622 | 106.42 | 20.749 | 12.34 |
A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn_.exps.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn.exps.=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.exps.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn._exps.=CUDA3"
-ot "ffn.*=CPU"
llm_load_tensors: CPU buffer size = 32013.47 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21682.90 MiB
llm_load_tensors: CUDA1 buffer size = 21646.44 MiB
llm_load_tensors: CUDA2 buffer size = 21682.90 MiB
llm_load_tensors: CUDA3 buffer size = 21646.44 MiB
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 9.878 | 103.67 | 23.039 | 11.11 |
1024 | 256 | 1024 | 9.441 | 108.46 | 21.745 | 11.77 |
1024 | 256 | 2048 | 9.364 | 109.35 | 20.607 | 12.42 |
1024 | 256 | 3072 | 9.379 | 109.18 | 20.445 | 12.52 |
1024 | 256 | 4096 | 9.486 | 107.95 | 20.648 | 12.40 |
1024 | 256 | 5120 | 9.407 | 108.86 | 20.830 | 12.29 |
1024 | 256 | 6144 | 9.543 | 107.30 | 21.139 | 12.11 |
1024 | 256 | 7168 | 9.497 | 107.82 | 20.938 | 12.23 |
1024 | 256 | 8192 | 9.578 | 106.91 | 21.761 | 11.76 |
1024 | 256 | 9216 | 9.574 | 106.96 | 21.873 | 11.70 |
1024 | 256 | 10240 | 9.668 | 105.91 | 21.942 | 11.67 |
1024 | 256 | 11264 | 9.780 | 104.70 | 22.522 | 11.37 |
1024 | 256 | 12288 | 9.762 | 104.90 | 22.656 | 11.30 |
1024 | 256 | 13312 | 9.809 | 104.39 | 23.003 | 11.13 |
1024 | 256 | 14336 | 9.890 | 103.54 | 22.788 | 11.23 |
1024 | 256 | 15360 | 9.953 | 102.89 | 23.373 | 10.95 |
1024 | 256 | 16384 | 9.883 | 103.61 | 23.347 | 10.96 |
Yet on IQ3 the reverse is true and it's much closer.