Testing smol-IQ4_KSS
W790E Sage + QYFS + 512G + RTX5090
Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11347.85 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2771.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122
main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4090 | 1022 | 0 | 52.482 | 77.93 | 86.375 | 11.83 |
4090 | 1022 | 4090 | 52.895 | 77.32 | 82.629 | 12.37 |
4090 | 1022 | 8180 | 83.991 | 48.70 | 80.144 | 12.75 |
4090 | 1022 | 12270 | 54.705 | 74.77 | 77.761 | 13.14 |
4090 | 1022 | 16360 | 54.601 | 74.91 | 96.437 | 10.60 |
Why do you use -t 101
on the 56Core QYFS CPU?
Have you tried like --threads 48 --threads-batch 56
for example which I assume would do better? Unless we had this discussion on another thread already haha... Generally SMT/Hyperthreading doesn't help or actually hurts speed and makes more heat. Also using a power of 2 feels nicer and might have some benefit, but maybe I'm just superstitious lol.
Why do you not use -ctv q8_0 ?
On performance, it's not any better. So is it because of stability?