Qwen/QwQ-32B · 8GB GPU can run this,10t/s

Mar 7

•

https://huggingface.co/mradermacher/QwQ-32B-i1-GGUF

llama-server.exe -m QwQ-32B.i1-IQ2_XXS.gguf -ngl 60 -fa -ctk q8_0 -ctv  q8_0  --temp 0.6 --top-p 0.95 --top-k 30 -c 2048 -n -1  --host 0.0.0.0 --port 8080 --reasoning-format deepseek

speed is about 10t/s

MrDevolver

Mar 7

I guess this is Nvidia using Cuda, right? Certainly not Vulkan, because I have 8GB AMD GPU using Vulkan and I'm getting about 2 t/s at best with Q2_K. I see you're running imatrix version, that's a whole story in itself for me personally they never work well - they never utilize GPU at all and they are slower (probably due to lack of GPU offloading to begin with).

MrDeepSuck

Mar 11

I guess this is Nvidia using Cuda, right? Certainly not Vulkan, because I have 8GB AMD GPU using Vulkan and I'm getting about 2 t/s at best with Q2_K. I see you're running imatrix version, that's a whole story in itself for me personally they never work well - they never utilize GPU at all and they are slower (probably due to lack of GPU offloading to begin with).

Notice! The model he used is QwQ-32B.i1-IQ2_XXS.gguf ! Only 12.9G!
As a contrast, in an 8 GB Nvidia GPU:
speed is 8-10 tps running QwQ-32B.i1-IQ2_XXS.gguf
speed is 3-5 tps running QwQ 32B