ubergarm/Kimi-K2-Instruct-0905-GGUF · Testing smol-IQ4

about 16 hours ago

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11347.85 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2771.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122

main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	52.482	77.93	86.375	11.83
4090	1022	4090	52.895	77.32	82.629	12.37
4090	1022	8180	83.991	48.70	80.144	12.75
4090	1022	12270	54.705	74.77	77.761	13.14
4090	1022	16360	54.601	74.91	96.437	10.60

ubergarm

Owner about 14 hours ago

•

edited about 14 hours ago

Why do you use -t 101 on the 56Core QYFS CPU?

Have you tried like --threads 48 --threads-batch 56 for example which I assume would do better? Unless we had this discussion on another thread already haha... Generally SMT/Hyperthreading doesn't help or actually hurts speed and makes more heat. Also using a power of 2 feels nicer and might have some benefit, but maybe I'm just superstitious lol.

shewin

about 2 hours ago

Many bios options for my workstation performance.

In my experience, half thread is not significantly better in power, speed or so

When SMT is turned off on the OS

But I like this

shewin

about 2 hours ago

Why do you not use -ctv q8_0 ?
On performance, it's not any better. So is it because of stability?

ubergarm
/

Kimi-K2-Instruct-0905-GGUF

Testing smol-IQ4_KSS