ArtusDev/Qwen3-235B-A22B-GGUF · Imitation is the highest form of flattery.

ubergarm

May 2

Oooh sweet! Thanks for doing the IQ6_K !!!

I'd love to see any PPL or KLD or llama-sweep-bench stats if you end up benchmarking!

Cheers!

ArtusDev

Owner May 2

Appreciated but it's me who should be saying thanks for your hard work!

ubergarm

May 2

Zero pressure! If anyone decides to try, I've been using a command like this:

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \ # <--- use f16 if kv cache is on GPU, or q8_0 for CPU (or if you want to trade lower VRAM for slower speed)
  -c 32768 \ # <--- start of with like 8192 just to see how it works if you want, it gets slower for deeper kv cache
  -fmoe \ # <--- fused moe typically helps a bit by optimizing how some matmul calculations are done
  -amb 512 \ # <--- this helps cap max VRAM usage by re-using  the specified buffer, if its too small it will slow down though due to looping
  -rtr \ # <--- repack any tensors going onto CPU into `_R4` to speed up memory/caching/CPU throughput
  -ot blk\.1[4-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 24 #<--- usually the number of physical cores for smaller non-server rigs, if fully offloaded to GPU go with 1 thread