ubergarm/GLM-4.5-GGUF · Speed Benchmarks.

17 days ago

So far I only have that Q3_K_XL

llm_load_print_meta: model size = 147.736 GiB (3.541 BPW)
llm_load_print_meta: repeating layers = 146.737 GiB (3.533 BPW, 356.786 B parameters)

4x3090 and xeon x2 Skylake with DDR-2666

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
-m GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)\.ffn_.*_exps.=CUDA0" \
-ot "blk\.(15|16|17|18|19|20|21|22|23|24|25|26)\.ffn_.*_exps.=CUDA1" \
-ot "blk\.(27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*_exps.=CUDA2" \
-ot "blk\.(39|40|41|42|43|44|45|46|47|48|49)\.ffn_.*_exps.=CUDA3" \
-ot "blk\.(50)\.ffn_(up|down)_exps\.weight=CUDA3" \
-ot "\.ffn_.*_exps.=CPU"

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 48, n_threads_batch = 48

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	8.909	114.94	19.304	13.26
1024	256	1024	8.926	114.72	20.289	12.62
1024	256	2048	9.014	113.60	21.401	11.96
1024	256	3072	9.124	112.23	22.406	11.43
1024	256	4096	9.169	111.68	24.579	10.42
1024	256	5120	9.219	111.08	25.270	10.13
1024	256	6144	9.346	109.57	26.882	9.52
1024	256	7168	9.405	108.88	29.608	8.65
1024	256	8192	9.510	107.68	30.990	8.26
1024	256	9216	9.580	106.89	32.581	7.86
1024	256	10240	9.674	105.85	34.193	7.49
1024	256	11264	9.778	104.73	35.900	7.13
1024	256	12288	9.827	104.21	37.272	6.87
1024	256	13312	9.911	103.32	38.888	6.58
1024	256	14336	9.985	102.56	40.304	6.35
1024	256	15360	10.039	102.00	41.832	6.12
1024	256	16384	10.121	101.17	43.216	5.92

Going to be interesting how IQ4_KSS fares. I should try to run perplexity on this one to compare too. What's the standard procedure with that so they can be compared?

ubergarm

Owner 16 days ago

@Lockout

Thanks for the report! This larger GLM model is quite usable even with older DDR RAM and you're even multi-NUMA node pretty amazing.

Regarding PP speed, you'll llikely get more tok/sec by omitting -rtr and increasing -ub 4096 -b 4096 but may need to offload less to free up the VRAM to support larger batches.

When using -rtr I tend to keep it at default batch sizes and offload as much as possible trying to get a little more TG.

I have a quant cookers guide on ik_llama.cpp discussions with some info https://github.com/ikawrakow/ik_llama.cpp/discussions/434

A quick reference of my usual perplexity procedure (i try to keep it almost exactly the same especially context size as it would otherwise vary a lot and not be useful for comparing):

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

The important things to hold consistent are:

ctx-size
exact wiki.test.raw file
fa should be very similar to non-fa but i report using fa path as it is typically what people use
leave kv-cache unquantized at full f16 as lower sizes very slightly effect perplexity

It is fine to change:

the seed is not actually used but i leave it as a kind of calling card lol
adjust threads and offload to match whatever is fastest for you
it is okay to adjust batch and ubatch (but i avoid going over 4096)

Thanks!

Lockout

15 days ago

•

edited 15 days ago

Unfortunately I found out something about 4096. Yes it gives you "faster" prompt processing (double), but it increases your time to reply. In actual chats the PP can be half on smaller prompts regardless of what the bench says. If you're doing big prompts, 4096, if you're chatting RTR and 512 or 1024 ubatch.

I'm testing the IKSS now to see where it falls. Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed. I wish sweep bench printed them in MiB like main, its a nice QOL thing.

Preliminary it looks like this:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	19.871	51.53	25.478	10.05
1024	256	1024	19.530	52.43	26.619	9.62
1024	256	2048	19.901	51.46	28.151	9.09
1024	256	3072	20.151	50.82	29.846	8.58
1024	256	4096	19.966	51.29	31.707	8.07

Probably have to drop caches since its loading slowly.

ubergarm

Owner 14 days ago

@Lockout

Yup, increasing the batch size can increase "latency" for a short prompts psure. And right for lots of long prompts it is beneficial, but every day shorter prompt mult-turn chats rtr and default or smaller batches is fine and might allow one more routed expert layer offloaded on VRAM too.

Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed.

You can view the tensors using the same gguf package from ik_llama.cpp. Unfortunately, hugging face doesn't support this and so most of my quants are not visible in the side-bar which is annoying and inconvenient indeed. Here is the gguf dump script:

$ cd ik_llama.cpp
$ source venv/bin/activate # or however you manage python packages
$ pip install numpy==1.26.4 # a few more packages you'll need too
$ python gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-main-IQ3_KS.gguf

I wish sweep bench printed them in MiB like main, its a nice QOL thing.

I'm not sure exactly where llama-sweep-bench prints tensors in MiB sizes? sweep-bench isn't even officially on mainline psure. If you have a short clip of the logs and/or command I might understand better. But yeah in general I too like my numbers in GiB and MiB!

Lockout

14 days ago

•

edited 14 days ago

llama.cpp prints it when loading.. mis-wrote. I assume loading seq is all the same tho.. it says offloaded tensor xyz to cpu (750mb) etc. In mainline I think you may have to enable verbose now to see it.

For that script, on mainline it calls the gguf_reader which is python. If you have the gguf py package installed it will use it's reader. I also had to patch it to display MB and not bytes. On a large list that's quite distracting.