Imitation is the highest form of flattery.
#1
by
ubergarm
- opened
Oooh sweet! Thanks for doing the IQ6_K
!!!
I'd love to see any PPL or KLD or llama-sweep-bench
stats if you end up benchmarking!
Cheers!
Appreciated but it's me who should be saying thanks for your hard work!
Zero pressure! If anyone decides to try, I've been using a command like this:
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk f16 -ctv f16 \ # <--- use f16 if kv cache is on GPU, or q8_0 for CPU (or if you want to trade lower VRAM for slower speed)
-c 32768 \ # <--- start of with like 8192 just to see how it works if you want, it gets slower for deeper kv cache
-fmoe \ # <--- fused moe typically helps a bit by optimizing how some matmul calculations are done
-amb 512 \ # <--- this helps cap max VRAM usage by re-using the specified buffer, if its too small it will slow down though due to looping
-rtr \ # <--- repack any tensors going onto CPU into `_R4` to speed up memory/caching/CPU throughput
-ot blk\.1[4-9]\.ffn.*=CPU \
-ot blk\.[2-8][0-9]\.ffn.*=CPU \
-ot blk\.9[0-3]\.ffn.*=CPU \
-ngl 99 \
--threads 24 #<--- usually the number of physical cores for smaller non-server rigs, if fully offloaded to GPU go with 1 thread