Testing smol-IQ4_KSS

#2
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11347.85 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2771.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122

main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 52.482 77.93 86.375 11.83
4090 1022 4090 52.895 77.32 82.629 12.37
4090 1022 8180 83.991 48.70 80.144 12.75
4090 1022 12270 54.705 74.77 77.761 13.14
4090 1022 16360 54.601 74.91 96.437 10.60

2025-09-08_21-00.png

Why do you use -t 101 on the 56Core QYFS CPU?

Have you tried like --threads 48 --threads-batch 56 for example which I assume would do better? Unless we had this discussion on another thread already haha... Generally SMT/Hyperthreading doesn't help or actually hurts speed and makes more heat. Also using a power of 2 feels nicer and might have some benefit, but maybe I'm just superstitious lol.

Many bios options for my workstation performance.

In my experience, half thread is not significantly better in power, speed or so
2025-09-09_10-04.png

When SMT is turned off on the OS
2025-09-09_10-30.png

But I like this
2025-09-09_10-43.png

Why do you not use -ctv q8_0 ?
On performance, it's not any better. So is it because of stability?

Sign up or log in to comment