IQ2_KS beat unsloth R1T2 UD IQ2_M
i just did some experiment the result is really good its even more better then the qwen235b ud q6. chimera response very similar to the claud sonnet with less think.
CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 22144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 7,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080
i used this for first run will try to do more optimal to see best point.
its just fit into my 3060 and dual xeon 2680 with 256gb ram and getting near 5.5t/sec and prombt process is over 22 to 36 t/sec
Thanks for the great report! Glad to hear you are liking the output and happy with the speed. Impressive you can fit 20k context on 16GB VRAM!
So given you have dual socket xeon and 256GB total RAM, I'm curious about your configuration. I assume that you have 128GB RAM per socket in two NUMA nodes configured in BIOS?
If your total 256 GB RAM is split across two numa nodes, then I'd suggest giving this a try. Instead of using a single CPU socket which will be required to access RAM from the other NUMA node, go ahead and try using both CPUs and distribute model across both NUMA nodes. (Unless you can configure BIOS for a single NUMA node like AMD Epyc NPS0
or you actually have 512GB total RAM but only want to use 256 GB in a single socket which is fine).
# assuming two numa nodes here for this next line
echo 0 | sudo tee -a /proc/sys/kernel/numa_balancing
# now run with numactl interleaving all NUMA nodes and distribute model across all
CUDA_VISIBLE_DEVICES="0" \
numactl --interleave=all \
./bin/llama-server \
--model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" \
--ctx-size 22144 \
-mla 3 \
-fa -amb 256 -fmoe \
-ngl 99 \
-ot exps=CPU \
-b 2048 -ub 2048 \
--parallel 1 \
--threads 42 \
--threads-batch 56 \
--numa distribute \
--temp 0.7 \
--min-p 0.05 \
--top-p 0.8 \
-ser 7,1 \
--run-time-repack \
--host 127.0.0.1 \
--port 8080
You might have enough VRAM available to increase -ub 3072 -b 3072
after dropping down -amb 256
maybe. I wouldn't recommend going lower than -amb 256
though. You can also try without --run-time-repack
as the non _r4
quant types are quite fast with large enough batch sizes on MoEs. In general I just use -mla 3
but of course you have to llama-sweep-bench
test everything to be sure how it effects your specific rig.
This setup will use both CPU cores for PP and likely boost you quite a bit there, but you will have probably similar TG speeds given cross-NUMA RAM bandwidth challenges.
CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/media/gopinath-s/C2A4E757A4E74D0B1/llama-cpp/models/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf" --ctx-size 102144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 2048 -ub 2048 --parallel 1 --threads 28 --threads-batch 28 --temp 0.7 --min-p 0.05 -ser 4,1 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080 (with ser speed and response increased "minimum ser 4,1 and maximum 7,1")
i just adjusted ser and getting arround 7.5 to 8 t/sec and prompt porceessing 45/sec response is really good tbh i never seen any model this much fast and accurate we need to run benchmark for this. i previously tried 0,2,4 numa nodes but 0 did well will try with 2 numa .my system is dual e5 xeon 2680 v4 with 256 ddr4 ram and 1tp hard disk and 3060 12gb vram . i will try above recommend setting