ubergarm/Qwen3-235B-A22B-GGUF

May 1

Hey, by any chance could you also create a Q6 K file with this format please?

Owner May 1

I'm thinking about making two more possibly, e.g. one that is a bit heavier and one that is a bit lighter. I've been testing the -mix-IQ3_K and it seems really good running locally, on par with DeepSeek-V3-0324 in my anecdotal opinion, but of course it is reasoning so takes a bit longer.

The current -mix-IQ3_k also barely fits on my rig, so I have to close my firefox browser to free up enough RAM and I'm running a super lean ARCH Linux + xwindows + dwm tiling window manager + alacritty terminal setup. So having a leaner version could be handy as most folks will likely have to run headless or have a little swap space setup to hold their browser RAM haha...

Any specific VRAM+RAM breakpoints you're working with regarding a possible IQ6_K version? I'd probably go full Q8_0 for all attention layers as they are pretty small, then do IQ6_K/IQ5_K for gate/(up|down) or similar...

Lockout

May 1

Is it worth to move to this from IQ4_XS? Those extra flags like fmoe and rtr have gained me 87.79t/s PP and 9.97t/s. In regular llama.cpp I only get half of that and was going to get smaller unsloth quant but they keep updating it and breaking my downloads.

How fast do your small deepseek run compared to qwen? I know jack about offloading compared to just doing GPU inference so much to learn.

ubergarm

Owner May 1

•

edited May 1

@Lockout

Is it worth to move to this from IQ4_XS?

If you can fit this -mix-IQ3_K in your rig, I can definitely recommend this over the unsloth UD-Q3_K_XL which is of similar size class. I don't have numbers on the unsloth IQ4_XS, but am working on more benchmarks now and hope to do a post on r/LocalLLaMA soon :tm:.

Here is a sneak peek of what I have already:

Interestingly, bartowski's Qwen3-30B-A3B ~4bpw quants are looking very competitive, hope to work more on that model soon :tm: too!

How fast do your small deepseek run compared to qwen?

In my limited testing on my local rig, I'd choose my Qwen3-235B-A22B-mix-IQ3_K every time now over my DeepSeek-V3-0324-IQ2_K_R4 or unreleased DeepSeek-R1-GGUF-Q2_K_R4. Qwen3 is much faster and also the quality feels in limited testing better than V3-324 at least and on par or possibly better than smaller R1 quants at least for coding type tasks.

Here is the speed graph running my Qwen3-235B-A22B-mix-IQ3_K locally on a 3090TI FE 24GB VRAM + AMD 9950X 2x48GB DDR5-6400 rig.

Also ik may be working on more improvements for GQA FA implementation on his fork which could improve speed even more possibly for Qwen3 and similar models.

Lockout

May 1

Seeing that KLD, I'm glad I didn't waste time with the other quant. D/S is better for creative tasks, unfortunately. I saw surprisingly decent speeds in the ik_llama discussions. My assumption would have been 2-3t/s at best without fancy new generation xeons.

The IQ4 gives me similar outputs to the API, if this does too and generates faster it would be a win.

Panchovix

May 1

•

edited May 2

Sorry to chime up here, but any possibility of Q4_K? I could fit it into my PC using ~20GB of RAM.

Now I'm downloading this one and it should fit fully on VRAM on my case. There wouldn't be any issues by using only CUDA?

EDIT: Tested on full CUDA and working fine! Pretty nice results while testing the model.

ubergarm

Owner May 2

@Lockout keep us posted, I'd love to hear if this meets your quality expectations! I've been impressed with it so far.

@Panchovix oh hey you have all the GPUs yes! Correct, I thought of you while making this model and did not repack the quants to _R4 myself to allow a wider variety of VRAM+RAM combinations to work out of the box. If someone wants to run on RAM they can use -rtr or the offline rapack tool themselves easily enough without downloading anything more.

If anyone is interested, I have some limited benchmarks for speed and quality on my fresh new Qwen3-30B-A3B-mix-IQ4_K.gguf and hit over 1600 tok/sec PP and 105 tok/sec TG peak on my 3090 TI FE 24GB VRAM!

Lockout

May 2

Heh, it's almost done downloading even, an hour left. I did not find any benefit with ik_llama for full gpu inference. It was slower than mainline. maybe it's different if you pass -fmoe and some of the other flags but dense models lost t/s. Come to think of it.. I can no-shit fully offload this quant too. If I recruit my 5th gpu will have 118gb. Also could install 1 or 2 24gb P40 or a P100.. power consumption not worth it though.

Also getting curious about THP and if that will help. It says to run it without mmap so -rtr is fine.. but do I then turn off the rtr to "benefit". From the issue it says to clear caches when switching so the weights will load from HDD once again. I have dual socket with 1 numa node per. Maybe q3/q4 of qwen is too small to need any of that?

Panchovix

May 2

On 128GB VRAM (5090+4090x2+A6000) but slow PCIe, I get lower speeds vs main yes but it's still pretty fast.

At X8/X8/X4/X4, I get 500 t/s pp and 23 t/s while generating (iq3 ikllamacpp)

On main I get same pp but 28t/s while generating (ud q3_k_xl)

On ud q4_k_xl with CPU offloading (20GB ram or so) I get 300 pp and 20 t/s while generating on main llamacpp.

Lockout

May 2

•

edited May 2

This quant I see some 11.x output token speeds so it's slightly faster.

run it like this with 32k:

--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-amb 1024 \
-ot "(1[0-9]).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" \

The ubatch increases PP speed over t/s and makes bigger buffers on cuda. I don't know if amb 1024/512 makes a difference.

PP is now over 100.

ok.. update as to quality....

So here I am torn, the model is more likely to not know what mesugaki means in IQ3, but also reloading the IQ4 has caused it to screw up often as well. Had this strange issue where I got better outputs when I offloaded more to CPU, repetition got to be less as well. I was testing CPU only inference with cuda PP. Now that I run it more, the IQ4 and IQ3 are both printing fairly similar t/s too.

ubergarm

Owner May 2

•

edited May 2

Thanks for all the testing and results all!

@Lockout
I just saw that @ArtusDev released what looks like some kind of IQ6_K version here: https://huggingface.co/ArtusDev/Qwen3-235B-A22B-GGUF (but I have not tested it myself). Looking at the model card sidebar it suggests they are using Q8_0 for ffn_down_exps which is a bit surprising to me for an IQ6_K. I don't see the exact recipe they used however, and unfortunately huggingface doesn't recognize iqN_k quants properly in the gguf dump side-bar... Theirs is almost double the size of this one, maybe enough bits to "know what mesugaki means" ? haha

TIL: i just looked it up myself....

ubergarm

Owner May 3

•

edited May 3

@Lockout @Panchovix

I was testing some offload commands way too late into the morning with some folks on a discord called "beaver ai club" possibly run by "thedrummer" (i think lol). anyway 59smoke and Aes Sedai and I worked out a better starting place for multi-gpu tensor-override for this model e.g.

./build/bin/llama-server \
    --model /mnt/models/ubergarm/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
    --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
    -fa \
    -ctk q8_0 -ctv q8_0 \
    -c 32768 \
    -fmoe \
    -amb 512 \
    -rtr \
    -ngl 99 \
    -ts 24,24 \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn.*=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26)\.ffn.*=CUDA1" \
    -ot "ffn.*=CPU" \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080

We saw a lot of speeds-ups by removing that *_exps part and matching all ffn layers otherwise some of the ffn tensors would be on GPU and some on CPU which was hurting performance by about 50% it seemed.

Seems like a 24GB GPU can hold maybe ~12-14ish layers or so depending on context and kv cache quantization etc...

EDIT: Also for multi GPU setups look into compiling with -DGGML_SCHED_MAX_COPIES=1 (default is 4) which may free up some VRAM etc.

i gotta update the model card once this gets ironed out a it better. thanks!

Panchovix

May 3

Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.

I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.

Panchovix

May 3

Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843

ubergarm

Owner May 3

Oh interesting, gonna try! By the way I wonder, would normal llamacpp quants work here as well on ikllamacpp? Like unsloth UD Q6_K.

Yes, you can run all normal mainline quantizations with ik_llama.cpp and take advantage of the -rtr and other optimizations as well without specifically needing the iqN_k quants. Check out these sweet benchmarks and commands by @AesSedai comparing some runs with that model:

https://github.com/ikawrakow/ik_llama.cpp/discussions/357#discussioncomment-13020187

I noticed a weird behaviour with offloading experts to CUDA as you have noticed on these models. Like setting a range for some reason doesn't work as expected. It didn't occur to me to use each layer specifically, so gonna try with that instead. Like i.e. I set 1 (one!) expert to a 5090 and it uses 26GB VRAM on Q4_K_XL.

Right, I haven't figured out exactly what is going on but there seems to be something that happens mixing -ts and -ot with multi-gpu setups that makes it difficult to understand how things will get allocated often leading to uneven distribution, OOMs, and such. So we started being more explicit about using -ot for more layers and it seemed to run more balanced across multi GPUs and show faster performance more inline with expectations.

I don't have 2x GPUs on my home rig to easily try and my remote rig is busy with some benchmarks for now.

ubergarm

Owner May 3

Oh and sorry to bump here, but any chance to add nemotron 253b support on ikllamacpp? From this PR https://github.com/ggml-org/llama.cpp/pull/12843

You're a demanding customer xD loljk, so I'd suggest opening an issue on ik_llama.cpp and look at recent PRs like the one I did for GLM-4 to see how to adapt model loading, building cuda graph, and other misc bits for adding a new architecture. Given nemotron 253b looks like a dense model, personally I'd hold off just a little bit until ik reworks attention going on here now: https://github.com/ikawrakow/ik_llama.cpp/pull/370

After that settles down a bit, then might be a good time to add more arch's, just my two cents.

Cheers!

Panchovix

May 3

Haha many thanks for all the info!

Panchovix

May 3

Okay just wanted to update, but the way to load the experts with each number is amazing! Could load Q6_K_M, on normal llamacpp with good usage with

./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 32768 --no-mmap --no-warmup -v -ngl 999 -fa -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU"

Which is about 21GB on each 4090, 30GB on the 5090 and 44GB on the A6000.

And got these speeds

prompt eval time =   57152.69 ms /  3877 tokens (   14.74 ms per token,    67.84 tokens per second)
       eval time =   38705.90 ms /   318 tokens (  121.72 ms per token,     8.22 tokens per second)

Not the best, not the worst, but pretty usable for coding! Now I have to test ikllamacpp to see if I get better speeds when offloading, but man separating the experts by each number is magic.

Panchovix

May 3

•

edited May 3

Okay ran ikllamacpp with similar command but added -fmoe, -amb 512 and -rtr, so it looks like this

./llama-server -m '/home/llm/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 999 -ot "blk\.(0|1|2|3|4|5|6|7|8)\.ffn.*=CUDA0" -ot "blk\.(9|10|11|12|13|14|15|16|17)\.ffn.*=CUDA1" -ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30)\.ffn.*=CUDA2" -ot "blk\.(31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52)\.ffn.*=CUDA3" -ot "ffn.*=CPU" -fmoe -amb 512 -rtr

And got a huge jump in PP performance

INFO [           print_timings] prompt eval time     =   39663.05 ms /  3877 tokens (   10.23 ms per token,    97.75 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_prompt_processing=39663.052 n_prompt_tokens_processed=3877 t_token=10.230346143925717 n_tokens_second=97.74840322424002
INFO [           print_timings] generation eval time =  102110.03 ms /   825 runs   (  123.77 ms per token,     8.08 tokens per second) | tid="140196332539904" timestamp=1746306311 id_slot=0 id_task=0 t_token_generation=102110.027 n_decoded=825 t_token=123.7697296969697 n_tokens_second=8.079519947634525

Basically 47-50% faster!

EDIT: Try 2

INFO [           print_timings] prompt eval time     =   36897.66 ms /  3877 tokens (    9.52 ms per token,   105.07 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_prompt_processing=36897.659 n_prompt_tokens_processed=3877 t_token=9.517064482847562 n_tokens_second=105.07441678075024
INFO [           print_timings] generation eval time =  143560.31 ms /  1197 runs   (  119.93 ms per token,     8.34 tokens per second) | tid="140095757803520" timestamp=1746307138 id_slot=0 id_task=0 t_token_generation=143560.31 n_decoded=1197 t_token=119.93342522974102 n_tokens_second=8.337959147622348

Pretty nice.

Lockout

May 4

I've been using llama-sweep-bench and found GGML_SCHED_MAX_COPIES=1 doesn't really help. You get higher t/s but slower PP. Many other params have this tradeoff. Better results with AMB of 512 than 1024. -ub 1024 worked best.

Will try to do just complete FFN layers. Way to make me re-test again.

iq3 with MAX_COPIES=1

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 95, n_threads = 28, n_threads_batch = 28

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	11.519	88.90	17.821	14.37
1024	256	1024	10.921	93.76	17.361	14.75
1024	256	2048	11.279	90.79	18.501	13.84
1024	256	3072	11.268	90.87	18.989	13.48
1024	256	4096	11.124	92.05	19.846	12.90
1024	256	5120	11.005	93.05	20.465	12.51
1024	256	6144	11.266	90.89	21.227	12.06
1024	256	7168	11.235	91.14	22.048	11.61
1024	256	8192	11.551	88.65	22.873	11.19
1024	256	9216	11.534	88.78	23.690	10.81
1024	256	10240	11.545	88.70	24.413	10.49
1024	256	11264	11.522	88.87	25.077	10.21
1024	256	12288	11.607	88.23	26.143	9.79
1024	256	13312	11.703	87.50	26.454	9.68
1024	256	14336	11.880	86.20	27.781	9.22
1024	256	15360	11.804	86.75	27.923	9.17
1024	256	16384	11.926	85.86	29.411	8.70

vs

-ot "\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" 


main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.406	108.87	19.733	12.97
1024	256	1024	9.044	113.23	19.282	13.28
1024	256	2048	9.103	112.49	19.568	13.08
1024	256	3072	9.230	110.94	20.020	12.79
1024	256	4096	9.330	109.75	21.192	12.08
1024	256	5120	9.285	110.28	21.932	11.67
1024	256	6144	9.331	109.75	22.542	11.36
1024	256	7168	9.635	106.28	23.735	10.79
1024	256	8192	9.540	107.33	24.221	10.57
1024	256	9216	9.896	103.48	25.540	10.02
1024	256	10240	9.931	103.12	25.744	9.94
1024	256	11264	9.852	103.94	27.056	9.46
1024	256	12288	9.959	102.82	27.363	9.36
1024	256	13312	9.900	103.43	28.057	9.12
1024	256	14336	10.082	101.57	28.988	8.83
1024	256	15360	10.252	99.88	29.665	8.63
1024	256	16384	10.381	98.64	30.715	8.33
1024	256	17408	10.377	98.68	31.747	8.06
1024	256	18432	10.496	97.56	32.407	7.90
1024	256	19456	10.405	98.42	33.066	7.74
1024	256	20480	10.678	95.90	34.071	7.51
1024	256	21504	10.622	96.40	34.884	7.34
1024	256	22528	10.793	94.88	35.753	7.16
1024	256	23552	10.855	94.34	36.423	7.03
1024	256	24576	11.138	91.94	37.135	6.89
1024	256	25600	11.020	92.92	37.695	6.79
1024	256	26624	11.241	91.09	38.460	6.66
1024	256	27648	11.156	91.79	39.634	6.46
1024	256	28672	11.297	90.64	40.637	6.30
1024	256	29696	11.609	88.21	41.458	6.17
1024	256	30720	11.420	89.66	41.816	6.12
1024	256	31744	11.560	88.58	42.828	5.98

-ot "(1[0-9]|39).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU"

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.455	108.30	20.046	12.77
1024	256	1024	9.044	113.23	19.252	13.30
1024	256	2048	9.134	112.11	19.727	12.98
1024	256	3072	9.173	111.63	20.501	12.49
1024	256	4096	9.157	111.82	21.064	12.15
1024	256	5120	9.322	109.85	22.093	11.59
1024	256	6144	9.289	110.24	22.626	11.31
1024	256	7168	9.510	107.67	23.796	10.76
1024	256	8192	9.641	106.21	24.726	10.35
1024	256	9216	9.674	105.85	25.821	9.91
1024	256	10240	9.857	103.88	26.529	9.65
1024	256	11264	9.906	103.37	27.412	9.34
1024	256	12288	10.087	101.52	28.002	9.14
1024	256	13312	9.963	102.78	28.809	8.89
1024	256	14336	10.214	100.25	29.980	8.54
1024	256	15360	10.263	99.78	30.997	8.26
1024	256	16384	10.286	99.56	31.577	8.11
1024	256	17408	10.511	97.42	32.338	7.92
1024	256	18432	10.451	97.98	32.650	7.84
1024	256	19456	10.491	97.61	33.754	7.58
1024	256	20480	10.703	95.67	33.956	7.54
1024	256	21504	10.707	95.64	34.782	7.36
1024	256	22528	10.773	95.05	35.988	7.11
1024	256	23552	10.946	93.55	36.824	6.95
1024	256	24576	11.020	92.92	37.100	6.90
1024	256	25600	10.987	93.20	38.272	6.69
1024	256	26624	11.166	91.71	39.116	6.54
1024	256	27648	11.420	89.67	40.111	6.38
1024	256	28672	11.370	90.06	41.202	6.21
1024	256	29696	11.510	88.97	41.707	6.14
1024	256	30720	11.573	88.48	42.415	6.04
1024	256	31744	11.530	88.82	42.722	5.99

Lockout

May 4

•

edited May 4

Also.. for some reason I have to set -ngl lower.. like 93/94 instead of 95. Otherwise it doesn't fill the GPUs but tries to allocate massive buffers while taking much fewer layers. It asked for 9GB+ even and it's not kv or the compute buffer.

I got it to load at 95 and perf is much worse

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	21.511	47.60	31.077	8.24
1024	256	1024	21.014	48.73	30.145	8.49

vs

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	10.069	101.70	20.471	12.51
1024	256	1024	9.933	103.09	19.225	13.32

Offloading sequential layers makes things more consistent but not necessarily faster.

AesSedai

May 4

@Lockout the reason for GGML_SCHED_MAX_COPIES is because ik_llama.cpp will try to duplicate the VRAM assignment if there is more than one GPU involved:

# ref: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L20201-L20216

// enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary
bool pipeline_parallel =
    llama_get_device_count(*model) > 1 &&
    model->n_gpu_layers > (int)model->hparams.n_layer &&
    model->split_mode == LLAMA_SPLIT_MODE_LAYER &&
    params.offload_kqv;
#ifndef GGML_USE_CUDA
// pipeline parallelism requires support for async compute and events
// currently this is only implemented in the CUDA backend
pipeline_parallel = false;
#endif
ctx->sched = ggml_backend_sched_new(ctx->backends.data(), backend_buft.data(), ctx->backends.size(), max_nodes, pipeline_parallel);

if (pipeline_parallel) {
    LLAMA_LOG_INFO("%s: pipeline parallelism enabled (n_copies=%d)\n", __func__, ggml_backend_sched_get_n_copies(ctx->sched));
}

and at least for me, with the 235B-A22B, I'm already loading up the entirety of my VRAM so I don't have 3x as much VRAM to spare for the parallelism. It's less a speed thing and more of a "I want the model to load at all, please" thing.

Panchovix

May 4

•

edited May 4

@AesSedai is correct. I was trying to use it as default, but then it tries to copy some buffers and then I get OOM. Specially a big buffer on the A6000 of 10GB.

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 77572.64 MiB
llm_load_tensors:  CUDA_Host buffer size =   486.86 MiB
llm_load_tensors:      CUDA0 buffer size = 18032.50 MiB
llm_load_tensors:      CUDA1 buffer size = 18032.50 MiB
llm_load_tensors:      CUDA2 buffer size = 25879.55 MiB
llm_load_tensors:      CUDA3 buffer size = 44064.14 MiB
...
llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  1152.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  1472.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  2240.00 MiB
llama_new_context_with_model: KV self size  = 6016.00 MiB, K (f16): 3008.00 MiB, V (f16): 3008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2386.63 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 1056.51 MiB
ggml_gallocr_reserve_n: reallocating CUDA2 buffer from size 0.00 MiB to 2732.50 MiB
ggml_gallocr_reserve_n: reallocating CUDA3 buffer from size 0.00 MiB to 10432.71 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 10432.71 MiB on device 3: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA3 buffer of size 10939492352
llama_new_context_with_model: failed to allocate compute buffers

Lockout

May 4

So that's the mystery. Turning off the copies must be what ngl 94 is accomplishing.

llm_load_tensors: offloaded 94/95 layers to GPU
llm_load_tensors: CPU buffer size = 19656.00 MiB
llm_load_tensors: CUDA_Host buffer size = 1261.20 MiB
llm_load_tensors: CUDA0 buffer size = 22148.27 MiB
llm_load_tensors: CUDA1 buffer size = 22089.93 MiB
llm_load_tensors: CUDA2 buffer size = 22148.27 MiB
llm_load_tensors: CUDA3 buffer size = 22089.93 MiB
....................................................................................................
============ Repacked 55 tensors
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 782.01 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 782.01 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 416.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 273.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 609.50 MiB

Lockout

May 4

btw.. holy crap.. pull the new FA fixes.

ubergarm

Owner May 4

•

edited May 4

This is such a beautiful discussion lmao, <3 y'all! I'll send folks over here as they embark on their multi-GPU tensor offload journey! haha

Lockout

May 4

Speaking of discussions.. who tried https://huggingface.co/MikeRoz/Qwen3-235B-A22B-exl2 and how it compares.

new FA IQ3_K

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.722	105.33	20.006	12.80
1024	256	1024	9.149	111.92	19.087	13.41
1024	256	2048	9.280	110.34	18.442	13.88
1024	256	3072	9.148	111.94	18.475	13.86
----snip
1024	256	28672	10.278	99.63	24.305	10.53
1024	256	29696	10.497	97.55	24.513	10.44
1024	256	30720	10.362	98.83	24.780	10.33
1024	256	31744	10.314	99.28	25.245	10.14

new FA IQ4_XS

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 28, n_threads_batch = 28

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	10.544	97.12	22.129	11.57
1024	256	1024	9.993	102.48	19.998	12.80
1024	256	2048	10.041	101.98	19.028	13.45
1024	256	3072	9.788	104.62	18.866	13.57
----snip
1024	256	28672	11.045	92.71	25.269	10.13
1024	256	29696	11.060	92.58	25.101	10.20
1024	256	30720	11.059	92.60	25.289	10.12
1024	256	31744	11.196	91.46	26.192	9.77

How to test KLD/perplexity painlessly. 6_K is probably too big of a speed drop, but these are almost identical. mainline llama.cpp doesn't even have the sweep bench to compare speeds.

AesSedai

May 4

@Lockout @ubergarm has most of the sweep-bench implementation for llama.cpp on this branch: https://github.com/ggml-org/llama.cpp/compare/master...ubergarm:llama.cpp:ug/port-sweep-bench

You can pull / cherrypick those and recompile llama.cpp to get sweep-bench there too.

ubergarm

Owner May 4

@Lockout

How to test KLD/perplexity painlessly.

I have slowly collected some PPL and KLD numbers here: https://github.com/ikawrakow/ik_llama.cpp/discussions/359#discussioncomment-13009539

Perplexity is as easy and inferencing with the model. But KLD is tricky as it make a big file and you ideally want to get the baseline using the full BF16 which may not be easy as it is over 400GB.

Example perplexity run for full offload:

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

Example KLD first pass to generate KLD base data file from bf16 (or q8_0 if that is the biggest you can fit). Using smaller fully offload example here, adjust with your exact arguments for a given bigger model etc:

model=/mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

Example KLD second pass using data file from above to test KLD of smaller models vs baseline.

model=/mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

The KLD run will also give you PPL for another data point with a different corpus.

Panchovix

May 5

Just made a reddit posts with some metrics if you guys are interested https://www.reddit.com/r/LocalLLaMA/comments/1kezq68/speed_metrics_running_deepseekv3_0324qwen3_235b/

There posted the speeds of offloading!

Lockout

May 5

•

edited May 5

Ok.. got perplexity working...

235b ubergarm iq3 : Final estimate: PPL = 3.8092 +/- 0.03584

235b IQ4_XS : Final estimate: PPL = 3.7938 +/- 0.03551

Using calibration_data_v5_rc.txt

If someone has a base/dataset for the 235b I can compare kld.

ubergarm

Owner May 5

@Panchovix nice looks like a lot of folks are interested in multi-gpu with big models like you've been testing, thanks for sharing and spreading the word with your updated commands!

@Lockout oh glad you got it running, I've been running a bunch of ppl/kld myself lately and hoping to release some data soon (maybe later today) if I can make some graphs.

I've been using wiki.test.raw wiki.test.raw.gz (make sure to gunzip it) for perplexity test and my own ubergarm-kld-test-corpus.txt (hopefully novel [never been trained on] data i got using whisper transcripts from a podcast, i've described it elsewhere better).

I think its best to run the tests against a dataset different than whatever people are using for imatrix calibration.

And right for the baseline 235B first pass I had to run the Q8_0 as I didn't have enough RAM+VRAM. Its a big file.

I have some limited numbers from earlier and there is good discussion here on the challenges given PPL is lower than BF16 for the 30B for some quants: https://github.com/ikawrakow/ik_llama.cpp/discussions/359

Lockout

May 5

It's someone else's calibration dataset. Not the one you or unsloth used. I will see what happens on wiki too. I've got 384gb of ram but my internet is way too slow to try to make my own quants. Takes overnight and into the next day to even get these. Deepseek will probably take me 2 days in Q2 form.

Lockout

May 6

What a difference only a few tiny layers make?!

A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn.=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn.=CUDA3"
-ot "ffn.*=CPU"

llm_load_tensors: CPU buffer size = 31876.41 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21717.16 MiB
llm_load_tensors: CUDA1 buffer size = 21680.71 MiB
llm_load_tensors: CUDA2 buffer size = 21717.16 MiB
llm_load_tensors: CUDA3 buffer size = 21680.71 MiB

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.337	109.68	20.979	12.20
1024	256	1024	9.049	113.16	17.932	14.28
1024	256	2048	8.915	114.87	17.710	14.45
1024	256	3072	9.015	113.59	17.950	14.26
1024	256	4096	9.130	112.16	18.154	14.10
1024	256	5120	9.124	112.23	18.203	14.06
1024	256	6144	9.217	111.10	19.760	12.96
1024	256	7168	9.202	111.28	18.715	13.68
1024	256	8192	9.548	107.24	19.221	13.32
1024	256	9216	9.303	110.07	19.298	13.27
1024	256	10240	9.411	108.81	19.781	12.94
1024	256	11264	9.335	109.70	19.705	12.99
1024	256	12288	9.496	107.83	20.257	12.64
1024	256	13312	9.540	107.34	20.536	12.47
1024	256	14336	9.619	106.46	20.685	12.38
1024	256	15360	9.578	106.91	21.045	12.16
1024	256	16384	9.622	106.42	20.749	12.34

A22B-GGUF/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf
-t 28
-c 32768
--host 192.168.1.211
--numa distribute
-ngl 94
-ctk q8_0
-ctv q8_0
-fa
-rtr
-fmoe
-amb 512
-ub 1024
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|).ffn_.exps.=CUDA0"
-ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33).ffn.exps.=CUDA1"
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.exps.=CUDA2"
-ot "blk.(51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67).ffn._exps.=CUDA3"
-ot "ffn.*=CPU"

llm_load_tensors: CPU buffer size = 32013.47 MiB
llm_load_tensors: CUDA_Host buffer size = 820.71 MiB
llm_load_tensors: CUDA0 buffer size = 21682.90 MiB
llm_load_tensors: CUDA1 buffer size = 21646.44 MiB
llm_load_tensors: CUDA2 buffer size = 21682.90 MiB
llm_load_tensors: CUDA3 buffer size = 21646.44 MiB

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	9.878	103.67	23.039	11.11
1024	256	1024	9.441	108.46	21.745	11.77
1024	256	2048	9.364	109.35	20.607	12.42
1024	256	3072	9.379	109.18	20.445	12.52
1024	256	4096	9.486	107.95	20.648	12.40
1024	256	5120	9.407	108.86	20.830	12.29
1024	256	6144	9.543	107.30	21.139	12.11
1024	256	7168	9.497	107.82	20.938	12.23
1024	256	8192	9.578	106.91	21.761	11.76
1024	256	9216	9.574	106.96	21.873	11.70
1024	256	10240	9.668	105.91	21.942	11.67
1024	256	11264	9.780	104.70	22.522	11.37
1024	256	12288	9.762	104.90	22.656	11.30
1024	256	13312	9.809	104.39	23.003	11.13
1024	256	14336	9.890	103.54	22.788	11.23
1024	256	15360	9.953	102.89	23.373	10.95
1024	256	16384	9.883	103.61	23.347	10.96

Yet on IQ3 the reverse is true and it's much closer.

ciprianv

May 14

•

edited May 15

hello everybody, I just observed that --ubatch-size 1024 gives almost double the prompt processing speed that --ubatch-size 512, with no big impact on generation speed, but consuming some extra context cache memory.
/llama-sweep-bench --model ~/ai/models/Qwen3-235B-UD_Q3_XL/Qwen3-235B-A22B-UD-Q3_K_XL 00001-of-00003.gguf --alias Qwen3-235B-A22B-UD-Q3_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 15000 -ot "blk.(?:[1|2]|[3-9][0-9]).ffn.*=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 -amb 512 --no-mmap --ubatch-size 1024--batch-size 1024 -ts 12,46

(main: n_kv_max = 15104, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	8.767	116.80	32.639	7.84
1024	256	1024	8.723	117.38	32.985	7.76
1024	256	2048	8.217	124.62	33.246	7.70
1024	256	3072	8.331	122.91	33.543	7.63
1024	256	4096	8.951	114.40	35.574	7.20

Autumnlight

May 16

@ubergarm I asked for q6 bec i got only have 256gb memory, sorry for my late response.

ubergarm

Owner May 16

•

edited May 16

@Autumnlight

@ubergarm I asked for q6 bec i got only have 256gb memory, sorry for my late response.

Makes sense, yeah if u have 256GB RAM a bigger model to use it up would be best if you're okay with the slower speed. You can check out this one for now to see if it suits your needs: https://huggingface.co/ArtusDev/Qwen3-235B-A22B-GGUF

Hopefully we start seeing more ik quants on hf too as more people find their way into it!