Q6 K

#1
by Autumnlight - opened

Hey, by any chance could you also create a Q6 K file with this format please?

I'm thinking about making two more possibly, e.g. one that is a bit heavier and one that is a bit lighter. I've been testing the -mix-IQ3_K and it seems really good running locally, on par with DeepSeek-V3-0324 in my anecdotal opinion, but of course it is reasoning so takes a bit longer.

The current -mix-IQ3_k also barely fits on my rig, so I have to close my firefox browser to free up enough RAM and I'm running a super lean ARCH Linux + xwindows + dwm tiling window manager + alacritty terminal setup. So having a leaner version could be handy as most folks will likely have to run headless or have a little swap space setup to hold their browser RAM haha...

Any specific VRAM+RAM breakpoints you're working with regarding a possible IQ6_K version? I'd probably go full Q8_0 for all attention layers as they are pretty small, then do IQ6_K/IQ5_K for gate/(up|down) or similar...

Is it worth to move to this from IQ4_XS? Those extra flags like fmoe and rtr have gained me 87.79t/s PP and 9.97t/s. In regular llama.cpp I only get half of that and was going to get smaller unsloth quant but they keep updating it and breaking my downloads.

How fast do your small deepseek run compared to qwen? I know jack about offloading compared to just doing GPU inference so much to learn.

@Lockout

Is it worth to move to this from IQ4_XS?

If you can fit this -mix-IQ3_K in your rig, I can definitely recommend this over the unsloth UD-Q3_K_XL which is of similar size class. I don't have numbers on the unsloth IQ4_XS, but am working on more benchmarks now and hope to do a post on r/LocalLLaMA soon :tm:.

Here is a sneak peek of what I have already:

qwen3-235b-fig-04.png

Interestingly, bartowski's Qwen3-30B-A3B ~4bpw quants are looking very competitive, hope to work more on that model soon :tm: too!

How fast do your small deepseek run compared to qwen?

In my limited testing on my local rig, I'd choose my Qwen3-235B-A22B-mix-IQ3_K every time now over my DeepSeek-V3-0324-IQ2_K_R4 or unreleased DeepSeek-R1-GGUF-Q2_K_R4. Qwen3 is much faster and also the quality feels in limited testing better than V3-324 at least and on par or possibly better than smaller R1 quants at least for coding type tasks.

Here is the speed graph running my Qwen3-235B-A22B-mix-IQ3_K locally on a 3090TI FE 24GB VRAM + AMD 9950X 2x48GB DDR5-6400 rig.

qwen3-moe-troll-rig.png

Also ik may be working on more improvements for GQA FA implementation on his fork which could improve speed even more possibly for Qwen3 and similar models.

Seeing that KLD, I'm glad I didn't waste time with the other quant. D/S is better for creative tasks, unfortunately. I saw surprisingly decent speeds in the ik_llama discussions. My assumption would have been 2-3t/s at best without fancy new generation xeons.

The IQ4 gives me similar outputs to the API, if this does too and generates faster it would be a win.

Sorry to chime up here, but any possibility of Q4_K? I could fit it into my PC using ~20GB of RAM.

Now I'm downloading this one and it should fit fully on VRAM on my case. There wouldn't be any issues by using only CUDA?

EDIT: Tested on full CUDA and working fine! Pretty nice results while testing the model.

@Lockout keep us posted, I'd love to hear if this meets your quality expectations! I've been impressed with it so far.

@Panchovix oh hey you have all the GPUs yes! Correct, I thought of you while making this model and did not repack the quants to _R4 myself to allow a wider variety of VRAM+RAM combinations to work out of the box. If someone wants to run on RAM they can use -rtr or the offline rapack tool themselves easily enough without downloading anything more.

If anyone is interested, I have some limited benchmarks for speed and quality on my fresh new Qwen3-30B-A3B-mix-IQ4_K.gguf and hit over 1600 tok/sec PP and 105 tok/sec TG peak on my 3090 TI FE 24GB VRAM!

Heh, it's almost done downloading even, an hour left. I did not find any benefit with ik_llama for full gpu inference. It was slower than mainline. maybe it's different if you pass -fmoe and some of the other flags but dense models lost t/s. Come to think of it.. I can no-shit fully offload this quant too. If I recruit my 5th gpu will have 118gb. Also could install 1 or 2 24gb P40 or a P100.. power consumption not worth it though.

Also getting curious about THP and if that will help. It says to run it without mmap so -rtr is fine.. but do I then turn off the rtr to "benefit". From the issue it says to clear caches when switching so the weights will load from HDD once again. I have dual socket with 1 numa node per. Maybe q3/q4 of qwen is too small to need any of that?

On 128GB VRAM (5090+4090x2+A6000) but slow PCIe, I get lower speeds vs main yes but it's still pretty fast.

At X8/X8/X4/X4, I get 500 t/s pp and 23 t/s while generating (iq3 ikllamacpp)

On main I get same pp but 28t/s while generating (ud q3_k_xl)

On ud q4_k_xl with CPU offloading (20GB ram or so) I get 300 pp and 20 t/s while generating on main llamacpp.

This quant I see some 11.x output token speeds so it's slightly faster.

run it like this with 32k:

--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-amb 1024 \
-ot "(1[0-9]).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" \

The ubatch increases PP speed over t/s and makes bigger buffers on cuda. I don't know if amb 1024/512 makes a difference.

PP is now over 100.

ok.. update as to quality....

So here I am torn, the model is more likely to not know what mesugaki means in IQ3, but also reloading the IQ4 has caused it to screw up often as well. Had this strange issue where I got better outputs when I offloaded more to CPU, repetition got to be less as well. I was testing CPU only inference with cuda PP. Now that I run it more, the IQ4 and IQ3 are both printing fairly similar t/s too.

Thanks for all the testing and results all!

@Lockout
I just saw that @ArtusDev released what looks like some kind of IQ6_K version here: https://huggingface.co/ArtusDev/Qwen3-235B-A22B-GGUF (but I have not tested it myself). Looking at the model card sidebar it suggests they are using Q8_0 for ffn_down_exps which is a bit surprising to me for an IQ6_K. I don't see the exact recipe they used however, and unfortunately huggingface doesn't recognize iqN_k quants properly in the gguf dump side-bar... Theirs is almost double the size of this one, maybe enough bits to "know what mesugaki means" ? haha

TIL: i just looked it up myself....

@Lockout @Panchovix

I was testing some offload commands way too late into the morning with some folks on a discord called "beaver ai club" possibly run by "thedrummer" (i think lol). anyway 59smoke and Aes Sedai and I worked out a better starting place for multi-gpu tensor-override for this model e.g.

./build/bin/llama-server \
    --model /mnt/models/ubergarm/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
    --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
    -fa \
    -ctk q8_0 -ctv q8_0 \
    -c 32768 \
    -fmoe \
    -amb 512 \
    -rtr \
    -ngl 99 \
    -ts 24,24 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn.*=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26)\.ffn.*=CUDA1" \
    -ot "ffn.*=CPU" \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080

We saw a lot of speeds-ups by removing that *_exps part and matching all ffn layers otherwise some of the ffn tensors would be on GPU and some on CPU which was hurting performance by about 50% it seemed.

Seems like a 24GB GPU can hold maybe ~12-14ish layers or so depending on context and kv cache quantization etc...

EDIT: Also for multi GPU setups look into compiling with -DGGML_SCHED_MAX_COPIES=1 (default is 4) which may free up some VRAM etc.

i gotta update the model card once this gets ironed out a it better. thanks!

Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.

I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.

Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843

Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.

Yes, you can run all normal mainline quantizations with ik_llama.cpp and take advantage of the -rtr and other optimizations as well without specifically needing the iqN_k quants. Check out these sweet benchmarks and commands by @AesSedai comparing some runs with that model:

https://github.com/ikawrakow/ik_llama.cpp/discussions/357#discussioncomment-13020187

I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.

Right, I haven't figured out exactly what is going on but there seems to be something that happens mixing -ts and -ot with multi-gpu setups that makes it difficult to understand how things will get allocated often leading to uneven distribution, OOMs, and such. So we started being more explicit about using -ot for more layers and it seemed to run more balanced across multi GPUs and show faster performance more inline with expectations.

I don't have 2x GPUs on my home rig to easily try and my remote rig is busy with some benchmarks for now.

Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843

You're a demanding customer xD loljk, so I'd suggest opening an issue on ik_llama.cpp and look at recent PRs like the one I did for GLM-4 to see how to adapt model loading, building cuda graph, and other misc bits for adding a new architecture. Given nemotron 253b looks like a dense model, personally I'd hold off just a little bit until ik reworks attention going on here now: https://github.com/ikawrakow/ik_llama.cpp/pull/370

After that settles down a bit, then might be a good time to add more arch's, just my two cents.

Cheers!

Haha many thanks for all the info!

Okay just wanted to update, but the way to load the experts with each number is amazing! Could load Q6_K_M, on normal llamacpp with good usage with

./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 32768 --no-mmap --no-warmup -v -ngl 999 -fa -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU"

Which is about 21GB on each 4090, 30GB on the 5090 and 44GB on the A6000.

And got these speeds

prompt eval time =   57152.69 ms /  3877 tokens (   14.74 ms per token,    67.84 tokens per second)
       eval time =   38705.90 ms /   318 tokens (  121.72 ms per token,     8.22 tokens per second)

Not the best, not the worst, but pretty usable for coding! Now I have to test ikllamacpp to see if I get better speeds when offloading, but man separating the experts by each number is magic.

Okay ran ikllamacpp with similar command but added -fmoe, -amb 512 and -rtr, so it looks like this

./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 999 -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU" -fmoe -amb 512 -rtr

And got a huge jump in PP performance

INFO [           print_timings] prompt eval time     =   39663.05 ms /  3877 tokens (   10.23 ms per token,    97.75 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_prompt_processing=39663.052 n_prompt_tokens_processed=3877 t_token=10.230346143925717 n_tokens_second=97.74840322424002
INFO [           print_timings] generation eval time =  102110.03 ms /   825 runs   (  123.77 ms per token,     8.08 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_token_generation=102110.027 n_decoded=825 t_token=123.7697296969697 n_tokens_second=8.079519947634525

Basically 47-50% faster!

EDIT: Try 2

INFO [           print_timings] prompt eval time     =   36897.66 ms /  3877 tokens (    9.52 ms per token,   105.07 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_prompt_processing=36897.659 n_prompt_tokens_processed=3877 t_token=9.517064482847562 n_tokens_second=105.07441678075024
INFO [           print_timings] generation eval time =  143560.31 ms /  1197 runs   (  119.93 ms per token,     8.34 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_token_generation=143560.31 n_decoded=1197 t_token=119.93342522974102 n_tokens_second=8.337959147622348

Pretty nice.

I've been using llama-sweep-bench and found GGML_SCHED_MAX_COPIES=1 doesn't really help. You get higher t/s but slower PP. Many other params have this tradeoff. Better results with AMB of 512 than 1024. -ub 1024 worked best.

Will try to do just complete FFN layers. Way to make me re-test again.

iq3 with MAX_COPIES=1

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 95, n_threads = 28, n_threads_batch = 28

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 11.519 88.90 17.821 14.37
1024 256 1024 10.921 93.76 17.361 14.75
1024 256 2048 11.279 90.79 18.501 13.84
1024 256 3072 11.268 90.87 18.989 13.48
1024 256 4096 11.124 92.05 19.846 12.90
1024 256 5120 11.005 93.05 20.465 12.51
1024 256 6144 11.266 90.89 21.227 12.06
1024 256 7168 11.235 91.14 22.048 11.61
1024 256 8192 11.551 88.65 22.873 11.19
1024 256 9216 11.534 88.78 23.690 10.81
1024 256 10240 11.545 88.70 24.413 10.49
1024 256 11264 11.522 88.87 25.077 10.21
1024 256 12288 11.607 88.23 26.143 9.79
1024 256 13312 11.703 87.50 26.454 9.68
1024 256 14336 11.880 86.20 27.781 9.22
1024 256 15360 11.804 86.75 27.923 9.17
1024 256 16384 11.926 85.86 29.411 8.70

vs

-ot "\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" 


main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.406 108.87 19.733 12.97
1024 256 1024 9.044 113.23 19.282 13.28
1024 256 2048 9.103 112.49 19.568 13.08
1024 256 3072 9.230 110.94 20.020 12.79
1024 256 4096 9.330 109.75 21.192 12.08
1024 256 5120 9.285 110.28 21.932 11.67
1024 256 6144 9.331 109.75 22.542 11.36
1024 256 7168 9.635 106.28 23.735 10.79
1024 256 8192 9.540 107.33 24.221 10.57
1024 256 9216 9.896 103.48 25.540 10.02
1024 256 10240 9.931 103.12 25.744 9.94
1024 256 11264 9.852 103.94 27.056 9.46
1024 256 12288 9.959 102.82 27.363 9.36
1024 256 13312 9.900 103.43 28.057 9.12
1024 256 14336 10.082 101.57 28.988 8.83
1024 256 15360 10.252 99.88 29.665 8.63
1024 256 16384 10.381 98.64 30.715 8.33
1024 256 17408 10.377 98.68 31.747 8.06
1024 256 18432 10.496 97.56 32.407 7.90
1024 256 19456 10.405 98.42 33.066 7.74
1024 256 20480 10.678 95.90 34.071 7.51
1024 256 21504 10.622 96.40 34.884 7.34
1024 256 22528 10.793 94.88 35.753 7.16
1024 256 23552 10.855 94.34 36.423 7.03
1024 256 24576 11.138 91.94 37.135 6.89
1024 256 25600 11.020 92.92 37.695 6.79
1024 256 26624 11.241 91.09 38.460 6.66
1024 256 27648 11.156 91.79 39.634 6.46
1024 256 28672 11.297 90.64 40.637 6.30
1024 256 29696 11.609 88.21 41.458 6.17
1024 256 30720 11.420 89.66 41.816 6.12
1024 256 31744 11.560 88.58 42.828 5.98
-ot "(1[0-9]|39).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU"

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.455 108.30 20.046 12.77
1024 256 1024 9.044 113.23 19.252 13.30
1024 256 2048 9.134 112.11 19.727 12.98
1024 256 3072 9.173 111.63 20.501 12.49
1024 256 4096 9.157 111.82 21.064 12.15
1024 256 5120 9.322 109.85 22.093 11.59
1024 256 6144 9.289 110.24 22.626 11.31
1024 256 7168 9.510 107.67 23.796 10.76
1024 256 8192 9.641 106.21 24.726 10.35
1024 256 9216 9.674 105.85 25.821 9.91
1024 256 10240 9.857 103.88 26.529 9.65
1024 256 11264 9.906 103.37 27.412 9.34
1024 256 12288 10.087 101.52 28.002 9.14
1024 256 13312 9.963 102.78 28.809 8.89
1024 256 14336 10.214 100.25 29.980 8.54
1024 256 15360 10.263 99.78 30.997 8.26
1024 256 16384 10.286 99.56 31.577 8.11
1024 256 17408 10.511 97.42 32.338 7.92
1024 256 18432 10.451 97.98 32.650 7.84
1024 256 19456 10.491 97.61 33.754 7.58
1024 256 20480 10.703 95.67 33.956 7.54
1024 256 21504 10.707 95.64 34.782 7.36
1024 256 22528 10.773 95.05 35.988 7.11
1024 256 23552 10.946 93.55 36.824 6.95
1024 256 24576 11.020 92.92 37.100 6.90
1024 256 25600 10.987 93.20 38.272 6.69
1024 256 26624 11.166 91.71 39.116 6.54
1024 256 27648 11.420 89.67 40.111 6.38
1024 256 28672 11.370 90.06 41.202 6.21
1024 256 29696 11.510 88.97 41.707 6.14
1024 256 30720 11.573 88.48 42.415 6.04
1024 256 31744 11.530 88.82 42.722 5.99

Also.. for some reason I have to set -ngl lower.. like 93/94 instead of 95. Otherwise it doesn't fill the GPUs but tries to allocate massive buffers while taking much fewer layers. It asked for 9GB+ even and it's not kv or the compute buffer.

I got it to load at 95 and perf is much worse

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 21.511 47.60 31.077 8.24
1024 256 1024 21.014 48.73 30.145 8.49

vs

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 10.069 101.70 20.471 12.51
1024 256 1024 9.933 103.09 19.225 13.32

Offloading sequential layers makes things more consistent but not necessarily faster.

@Lockout the reason for GGML_SCHED_MAX_COPIES is because ik_llama.cpp will try to duplicate the VRAM assignment if there is more than one GPU involved:

# ref: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L20201-L20216

// enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary
bool pipeline_parallel =
    llama_get_device_count(*model) > 1 &&
    model->n_gpu_layers > (int)model->hparams.n_layer &&
    model->split_mode == LLAMA_SPLIT_MODE_LAYER &&
    params.offload_kqv;
#ifndef GGML_USE_CUDA
// pipeline parallelism requires support for async compute and events
// currently this is only implemented in the CUDA backend
pipeline_parallel = false;
#endif
ctx->sched = ggml_backend_sched_new(ctx->backends.data(), backend_buft.data(), ctx->backends.size(), max_nodes, pipeline_parallel);

if (pipeline_parallel) {
    LLAMA_LOG_INFO("%s: pipeline parallelism enabled (n_copies=%d)\n", __func__, ggml_backend_sched_get_n_copies(ctx->sched));
}

and at least for me, with the 235B-A22B, I'm already loading up the entirety of my VRAM so I don't have 3x as much VRAM to spare for the parallelism. It's less a speed thing and more of a "I want the model to load at all, please" thing.

@AesSedai is correct. I was trying to use it as default, but then it tries to copy some buffers and then I get OOM. Specially a big buffer on the A6000 of 10GB.

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 77572.64 MiB
llm_load_tensors:  CUDA_Host buffer size =   486.86 MiB
llm_load_tensors:      CUDA0 buffer size = 18032.50 MiB
llm_load_tensors:      CUDA1 buffer size = 18032.50 MiB
llm_load_tensors:      CUDA2 buffer size = 25879.55 MiB
llm_load_tensors:      CUDA3 buffer size = 44064.14 MiB
...
llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  1152.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  1472.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  2240.00 MiB
llama_new_context_with_model: KV self size  = 6016.00 MiB, K (f16): 3008.00 MiB, V (f16): 3008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2386.63 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 1056.51 MiB
ggml_gallocr_reserve_n: reallocating CUDA2 buffer from size 0.00 MiB to 2732.50 MiB
ggml_gallocr_reserve_n: reallocating CUDA3 buffer from size 0.00 MiB to 10432.71 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 10432.71 MiB on device 3: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 10939492352
llama_new_context_with_model: failed to allocate compute buffers

So that's the mystery. Turning off the copies must be what ngl 94 is accomplishing.

llm_load_tensors: offloaded 94/95 layers to GPU
llm_load_tensors: CPU buffer size = 19656.00 MiB
llm_load_tensors: CUDA_Host buffer size = 1261.20 MiB
llm_load_tensors: CUDA0 buffer size = 22148.27 MiB
llm_load_tensors: CUDA1 buffer size = 22089.93 MiB
llm_load_tensors: CUDA2 buffer size = 22148.27 MiB
llm_load_tensors: CUDA3 buffer size = 22089.93 MiB
....................................................................................................
============ Repacked 55 tensors
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 782.01 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 782.01 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 416.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 609.50 MiB

btw.. holy crap.. pull the new FA fixes.

This is such a beautiful discussion lmao, <3 y'all! I'll send folks over here as they embark on their multi-GPU tensor offload journey! haha

Speaking of discussions.. who tried https://huggingface.co/MikeRoz/Qwen3-235B-A22B-exl2 and how it compares.

new FA IQ3_K

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.722 105.33 20.006 12.80
1024 256 1024 9.149 111.92 19.087 13.41
1024 256 2048 9.280 110.34 18.442 13.88
1024 256 3072 9.148 111.94 18.475 13.86
----snip
1024 256 28672 10.278 99.63 24.305 10.53
1024 256 29696 10.497 97.55 24.513 10.44
1024 256 30720 10.362 98.83 24.780 10.33
1024 256 31744 10.314 99.28 25.245 10.14

new FA IQ4_XS

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 10.544 97.12 22.129 11.57
1024 256 1024 9.993 102.48 19.998 12.80
1024 256 2048 10.041 101.98 19.028 13.45
1024 256 3072 9.788 104.62 18.866 13.57
----snip
1024 256 28672 11.045 92.71 25.269 10.13
1024 256 29696 11.060 92.58 25.101 10.20
1024 256 30720 11.059 92.60 25.289 10.12
1024 256 31744 11.196 91.46 26.192 9.77

How to test KLD/perplexity painlessly. 6_K is probably too big of a speed drop, but these are almost identical. mainline llama.cpp doesn't even have the sweep bench to compare speeds.

@Lockout @ubergarm has most of the sweep-bench implementation for llama.cpp on this branch: https://github.com/ggml-org/llama.cpp/compare/master...ubergarm:llama.cpp:ug/port-sweep-bench

You can pull / cherrypick those and recompile llama.cpp to get sweep-bench there too.

@Lockout

How to test KLD/perplexity painlessly.

I have slowly collected some PPL and KLD numbers here: https://github.com/ikawrakow/ik_llama.cpp/discussions/359#discussioncomment-13009539

Perplexity is as easy and inferencing with the model. But KLD is tricky as it make a big file and you ideally want to get the baseline using the full BF16 which may not be easy as it is over 400GB.

Example perplexity run for full offload:

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

Example KLD first pass to generate KLD base data file from bf16 (or q8_0 if that is the biggest you can fit). Using smaller fully offload example here, adjust with your exact arguments for a given bigger model etc:

model=/mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

Example KLD second pass using data file from above to test KLD of smaller models vs baseline.

model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

The KLD run will also give you PPL for another data point with a different corpus.

Just made a reddit posts with some metrics if you guys are interested https://www.reddit.com/r/LocalLLaMA/comments/1kezq68/speed_metrics_running_deepseekv3_0324qwen3_235b/

There posted the speeds of offloading!

Ok.. got perplexity working...

235b ubergarm iq3 : Final estimate: PPL = 3.8092 +/- 0.03584

235b IQ4_XS : Final estimate: PPL = 3.7938 +/- 0.03551

Using calibration_data_v5_rc.txt

If someone has a base/dataset for the 235b I can compare kld.

@Panchovix nice looks like a lot of folks are interested in multi-gpu with big models like you've been testing, thanks for sharing and spreading the word with your updated commands!

@Lockout oh glad you got it running, I've been running a bunch of ppl/kld myself lately and hoping to release some data soon (maybe later today) if I can make some graphs.

I've been using wiki.test.raw wiki.test.raw.gz (make sure to gunzip it) for perplexity test and my own ubergarm-kld-test-corpus.txt (hopefully novel [never been trained on] data i got using whisper transcripts from a podcast, i've described it elsewhere better).

I think its best to run the tests against a dataset different than whatever people are using for imatrix calibration.

And right for the baseline 235B first pass I had to run the Q8_0 as I didn't have enough RAM+VRAM. Its a big file.

I have some limited numbers from earlier and there is good discussion here on the challenges given PPL is lower than BF16 for the 30B for some quants: https://github.com/ikawrakow/ik_llama.cpp/discussions/359

It's someone else's calibration dataset. Not the one you or unsloth used. I will see what happens on wiki too. I've got 384gb of ram but my internet is way too slow to try to make my own quants. Takes overnight and into the next day to even get these. Deepseek will probably take me 2 days in Q2 form.

What a difference only a few tiny layers make?!

A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn.
=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn.
=CUDA3"
-ot "ffn.*=CPU"

llm_load_tensors: CPU buffer size = 31876.41 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21717.16 MiB
llm_load_tensors: CUDA1 buffer size = 21680.71 MiB
llm_load_tensors: CUDA2 buffer size = 21717.16 MiB
llm_load_tensors: CUDA3 buffer size = 21680.71 MiB

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.337 109.68 20.979 12.20
1024 256 1024 9.049 113.16 17.932 14.28
1024 256 2048 8.915 114.87 17.710 14.45
1024 256 3072 9.015 113.59 17.950 14.26
1024 256 4096 9.130 112.16 18.154 14.10
1024 256 5120 9.124 112.23 18.203 14.06
1024 256 6144 9.217 111.10 19.760 12.96
1024 256 7168 9.202 111.28 18.715 13.68
1024 256 8192 9.548 107.24 19.221 13.32
1024 256 9216 9.303 110.07 19.298 13.27
1024 256 10240 9.411 108.81 19.781 12.94
1024 256 11264 9.335 109.70 19.705 12.99
1024 256 12288 9.496 107.83 20.257 12.64
1024 256 13312 9.540 107.34 20.536 12.47
1024 256 14336 9.619 106.46 20.685 12.38
1024 256 15360 9.578 106.91 21.045 12.16
1024 256 16384 9.622 106.42 20.749 12.34

A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn_.exps.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn
.
exps.=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn
.exps.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn
.
_exps.=CUDA3"
-ot "ffn.*=CPU"

llm_load_tensors: CPU buffer size = 32013.47 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21682.90 MiB
llm_load_tensors: CUDA1 buffer size = 21646.44 MiB
llm_load_tensors: CUDA2 buffer size = 21682.90 MiB
llm_load_tensors: CUDA3 buffer size = 21646.44 MiB

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 9.878 103.67 23.039 11.11
1024 256 1024 9.441 108.46 21.745 11.77
1024 256 2048 9.364 109.35 20.607 12.42
1024 256 3072 9.379 109.18 20.445 12.52
1024 256 4096 9.486 107.95 20.648 12.40
1024 256 5120 9.407 108.86 20.830 12.29
1024 256 6144 9.543 107.30 21.139 12.11
1024 256 7168 9.497 107.82 20.938 12.23
1024 256 8192 9.578 106.91 21.761 11.76
1024 256 9216 9.574 106.96 21.873 11.70
1024 256 10240 9.668 105.91 21.942 11.67
1024 256 11264 9.780 104.70 22.522 11.37
1024 256 12288 9.762 104.90 22.656 11.30
1024 256 13312 9.809 104.39 23.003 11.13
1024 256 14336 9.890 103.54 22.788 11.23
1024 256 15360 9.953 102.89 23.373 10.95
1024 256 16384 9.883 103.61 23.347 10.96

Yet on IQ3 the reverse is true and it's much closer.

Sign up or log in to comment