Speed Benchmarks.

#4
by Lockout - opened

So far I only have that Q3_K_XL

llm_load_print_meta: model size = 147.736 GiB (3.541 BPW)
llm_load_print_meta: repeating layers = 146.737 GiB (3.533 BPW, 356.786 B parameters)

4x3090 and xeon x2 Skylake with DDR-2666

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
-m GLM-4.5-UD-Q3_K_XL-00001-of-00004.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14)\.ffn_.*_exps.=CUDA0" \
-ot "blk\.(15|16|17|18|19|20|21|22|23|24|25|26)\.ffn_.*_exps.=CUDA1" \
-ot "blk\.(27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*_exps.=CUDA2" \
-ot "blk\.(39|40|41|42|43|44|45|46|47|48|49)\.ffn_.*_exps.=CUDA3" \
-ot "blk\.(50)\.ffn_(up|down)_exps\.weight=CUDA3" \
-ot "\.ffn_.*_exps.=CPU"

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 94, n_threads = 48, n_threads_batch = 48

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 8.909 114.94 19.304 13.26
1024 256 1024 8.926 114.72 20.289 12.62
1024 256 2048 9.014 113.60 21.401 11.96
1024 256 3072 9.124 112.23 22.406 11.43
1024 256 4096 9.169 111.68 24.579 10.42
1024 256 5120 9.219 111.08 25.270 10.13
1024 256 6144 9.346 109.57 26.882 9.52
1024 256 7168 9.405 108.88 29.608 8.65
1024 256 8192 9.510 107.68 30.990 8.26
1024 256 9216 9.580 106.89 32.581 7.86
1024 256 10240 9.674 105.85 34.193 7.49
1024 256 11264 9.778 104.73 35.900 7.13
1024 256 12288 9.827 104.21 37.272 6.87
1024 256 13312 9.911 103.32 38.888 6.58
1024 256 14336 9.985 102.56 40.304 6.35
1024 256 15360 10.039 102.00 41.832 6.12
1024 256 16384 10.121 101.17 43.216 5.92

Going to be interesting how IQ4_KSS fares. I should try to run perplexity on this one to compare too. What's the standard procedure with that so they can be compared?

@Lockout

Thanks for the report! This larger GLM model is quite usable even with older DDR RAM and you're even multi-NUMA node pretty amazing.

Regarding PP speed, you'll llikely get more tok/sec by omitting -rtr and increasing -ub 4096 -b 4096 but may need to offload less to free up the VRAM to support larger batches.

When using -rtr I tend to keep it at default batch sizes and offload as much as possible trying to get a little more TG.

I have a quant cookers guide on ik_llama.cpp discussions with some info https://github.com/ikawrakow/ik_llama.cpp/discussions/434

A quick reference of my usual perplexity procedure (i try to keep it almost exactly the same especially context size as it would otherwise vary a lot and not be useful for comparing):

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

The important things to hold consistent are:

  • ctx-size
  • exact wiki.test.raw file
  • fa should be very similar to non-fa but i report using fa path as it is typically what people use
  • leave kv-cache unquantized at full f16 as lower sizes very slightly effect perplexity

It is fine to change:

  • the seed is not actually used but i leave it as a kind of calling card lol
  • adjust threads and offload to match whatever is fastest for you
  • it is okay to adjust batch and ubatch (but i avoid going over 4096)

Thanks!

Unfortunately I found out something about 4096. Yes it gives you "faster" prompt processing (double), but it increases your time to reply. In actual chats the PP can be half on smaller prompts regardless of what the bench says. If you're doing big prompts, 4096, if you're chatting RTR and 512 or 1024 ubatch.

I'm testing the IKSS now to see where it falls. Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed. I wish sweep bench printed them in MiB like main, its a nice QOL thing.

Preliminary it looks like this:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 19.871 51.53 25.478 10.05
1024 256 1024 19.530 52.43 26.619 9.62
1024 256 2048 19.901 51.46 28.151 9.09
1024 256 3072 20.151 50.82 29.846 8.58
1024 256 4096 19.966 51.29 31.707 8.07

Probably have to drop caches since its loading slowly.

@Lockout

Yup, increasing the batch size can increase "latency" for a short prompts psure. And right for lots of long prompts it is beneficial, but every day shorter prompt mult-turn chats rtr and default or smaller batches is fine and might allow one more routed expert layer offloaded on VRAM too.

Unfortunately gguf scripts don't dump the tensors for me to see file sizes . Likely because I have the mainline gguf package installed.

You can view the tensors using the same gguf package from ik_llama.cpp. Unfortunately, hugging face doesn't support this and so most of my quants are not visible in the side-bar which is annoying and inconvenient indeed. Here is the gguf dump script:

$ cd ik_llama.cpp
$ source venv/bin/activate # or however you manage python packages
$ pip install numpy==1.26.4 # a few more packages you'll need too
$ python gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-main-IQ3_KS.gguf

I wish sweep bench printed them in MiB like main, its a nice QOL thing.

I'm not sure exactly where llama-sweep-bench prints tensors in MiB sizes? sweep-bench isn't even officially on mainline psure. If you have a short clip of the logs and/or command I might understand better. But yeah in general I too like my numbers in GiB and MiB!

llama.cpp prints it when loading.. mis-wrote. I assume loading seq is all the same tho.. it says offloaded tensor xyz to cpu (750mb) etc. In mainline I think you may have to enable verbose now to see it.

For that script, on mainline it calls the gguf_reader which is python. If you have the gguf py package installed it will use it's reader. I also had to patch it to display MB and not bytes. On a large list that's quite distracting.

Sign up or log in to comment