Speed Benchmarks.
So far I only have that Q3_K_XL
llm_load_print_meta: model size = 147.736 GiB (3.541 BPW)
llm_load_print_meta: repeating layers = 146.737 GiB (3.533 BPW, 356.786 B parameters)
4x3090 and xeon x2 Skylake with DDR-2666
CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
-m GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)\.ffn_.*_exps.=CUDA0" \
-ot "blk\.(15|16|17|18|19|20|21|22|23|24|25|26)\.ffn_.*_exps.=CUDA1" \
-ot "blk\.(27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*_exps.=CUDA2" \
-ot "blk\.(39|40|41|42|43|44|45|46|47|48|49)\.ffn_.*_exps.=CUDA3" \
-ot "blk\.(50)\.ffn_(up|down)_exps\.weight=CUDA3" \
-ot "\.ffn_.*_exps.=CPU"
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 48, n_threads_batch = 48
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 8.909 | 114.94 | 19.304 | 13.26 |
1024 | 256 | 1024 | 8.926 | 114.72 | 20.289 | 12.62 |
1024 | 256 | 2048 | 9.014 | 113.60 | 21.401 | 11.96 |
1024 | 256 | 3072 | 9.124 | 112.23 | 22.406 | 11.43 |
1024 | 256 | 4096 | 9.169 | 111.68 | 24.579 | 10.42 |
1024 | 256 | 5120 | 9.219 | 111.08 | 25.270 | 10.13 |
1024 | 256 | 6144 | 9.346 | 109.57 | 26.882 | 9.52 |
1024 | 256 | 7168 | 9.405 | 108.88 | 29.608 | 8.65 |
1024 | 256 | 8192 | 9.510 | 107.68 | 30.990 | 8.26 |
1024 | 256 | 9216 | 9.580 | 106.89 | 32.581 | 7.86 |
1024 | 256 | 10240 | 9.674 | 105.85 | 34.193 | 7.49 |
1024 | 256 | 11264 | 9.778 | 104.73 | 35.900 | 7.13 |
1024 | 256 | 12288 | 9.827 | 104.21 | 37.272 | 6.87 |
1024 | 256 | 13312 | 9.911 | 103.32 | 38.888 | 6.58 |
1024 | 256 | 14336 | 9.985 | 102.56 | 40.304 | 6.35 |
1024 | 256 | 15360 | 10.039 | 102.00 | 41.832 | 6.12 |
1024 | 256 | 16384 | 10.121 | 101.17 | 43.216 | 5.92 |
Going to be interesting how IQ4_KSS fares. I should try to run perplexity on this one to compare too. What's the standard procedure with that so they can be compared?
Thanks for the report! This larger GLM model is quite usable even with older DDR RAM and you're even multi-NUMA node pretty amazing.
Regarding PP speed, you'll llikely get more tok/sec by omitting -rtr
and increasing -ub 4096 -b 4096
but may need to offload less to free up the VRAM to support larger batches.
When using -rtr
I tend to keep it at default batch sizes and offload as much as possible trying to get a little more TG.
I have a quant cookers guide on ik_llama.cpp discussions with some info https://github.com/ikawrakow/ik_llama.cpp/discussions/434
A quick reference of my usual perplexity procedure (i try to keep it almost exactly the same especially context size as it would otherwise vary a lot and not be useful for comparing):
$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea wiki.test.raw
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf
./build/bin/llama-perplexity \
-m "$model" \
--ctx-size 512 \
-f wiki.test.raw \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
The important things to hold consistent are:
- ctx-size
- exact
wiki.test.raw
file - fa should be very similar to non-fa but i report using fa path as it is typically what people use
- leave kv-cache unquantized at full f16 as lower sizes very slightly effect perplexity
It is fine to change:
- the seed is not actually used but i leave it as a kind of calling card lol
- adjust threads and offload to match whatever is fastest for you
- it is okay to adjust batch and ubatch (but i avoid going over 4096)
Thanks!
Unfortunately I found out something about 4096. Yes it gives you "faster" prompt processing (double), but it increases your time to reply. In actual chats the PP can be half on smaller prompts regardless of what the bench says. If you're doing big prompts, 4096, if you're chatting RTR and 512 or 1024 ubatch.
I'm testing the IKSS now to see where it falls. Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed. I wish sweep bench printed them in MiB like main, its a nice QOL thing.
Preliminary it looks like this:
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 19.871 | 51.53 | 25.478 | 10.05 |
1024 | 256 | 1024 | 19.530 | 52.43 | 26.619 | 9.62 |
1024 | 256 | 2048 | 19.901 | 51.46 | 28.151 | 9.09 |
1024 | 256 | 3072 | 20.151 | 50.82 | 29.846 | 8.58 |
1024 | 256 | 4096 | 19.966 | 51.29 | 31.707 | 8.07 |
Probably have to drop caches since its loading slowly.
Yup, increasing the batch size can increase "latency" for a short prompts psure. And right for lots of long prompts it is beneficial, but every day shorter prompt mult-turn chats rtr and default or smaller batches is fine and might allow one more routed expert layer offloaded on VRAM too.
Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed.
You can view the tensors using the same gguf package from ik_llama.cpp. Unfortunately, hugging face doesn't support this and so most of my quants are not visible in the side-bar which is annoying and inconvenient indeed. Here is the gguf dump script:
$ cd ik_llama.cpp
$ source venv/bin/activate # or however you manage python packages
$ pip install numpy==1.26.4 # a few more packages you'll need too
$ python gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-main-IQ3_KS.gguf
I wish sweep bench printed them in MiB like main, its a nice QOL thing.
I'm not sure exactly where llama-sweep-bench prints tensors in MiB sizes? sweep-bench isn't even officially on mainline psure. If you have a short clip of the logs and/or command I might understand better. But yeah in general I too like my numbers in GiB and MiB!
llama.cpp prints it when loading.. mis-wrote. I assume loading seq is all the same tho.. it says offloaded tensor xyz to cpu (750mb) etc. In mainline I think you may have to enable verbose now to see it.
For that script, on mainline it calls the gguf_reader which is python. If you have the gguf py package installed it will use it's reader. I also had to patch it to display MB and not bytes. On a large list that's quite distracting.