Thank you!
My fav chef is busy in the kitchen again. Just wanted to express my gratitude!
Thank you!
The IQ2_KL just landed, bigger quants coming soon. I'll likely re-use the recipes from the Instruct version for now and revist later if there are any special requests or I find better recipes. Cheers and thanks for the encouragement! Have fun with all your GPUs haha
Okie, finished with the initial quants and perplexity testing graph is up! Choose your own adventure!
Thanks, IQ2 quant is a good size to try on four MI50s.
Qwen made this week very busy! :)
guys why this model is thinking continuously any one knows any idea where i should correct it ?
guys why this model is thinking continuously any one knows any idea where i should correct it ?
are you using llama-server
and the chat endpoint? or are you using the text completions endpoint? my first guess would be confirm your client is using the correct template or endpoint. otherwise there may be a way to set the probability of the token higher somehow to try to "reduce" thinking? hah..
Also how much is "continuously"? From the Qwen3 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
So it might naturally be a yappy thinker.
guys why this model is thinking continuously any one knows any idea where i should correct it ?
are you using
llama-server
and the chat endpoint? or are you using the text completions endpoint? my first guess would be confirm your client is using the correct template or endpoint. otherwise there may be a way to set the probability of the token higher somehow to try to "reduce" thinking? hah..Also how much is "continuously"? From the Qwen3 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.
For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
So it might naturally be a yappy thinker.
CUDA_VISIBLE_DEVICES="" ./bin/llama-server --model "/home/gopi/Qwen3-235B-A22B-Thinking-2507-IQ5_K-00001-of-00004.gguf" --ctx-size 81920 -fa -amb 512 -fmoe --n-gpu-layers 0 -b 200 -ub 200 --parallel 1 --threads 52 --threads-batch 52 --temp 0.7 --min-p 0.01 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080
i am just running like this and its just thinking and thinking lol ;p
update
increased the presence penalty from 0 to 1 now its giving me proper answer
increased the presence penalty from 0 to 1 now its giving me proper answer
Great, glad the original Qwen3 model card had the answer!
P.S. No need for -amb
with Qwen3, that is only for DeepSeek/Kimi-K2.
Also it is so strange you use -ub 200 -b 200
there might be speed boost for using power of two or just leave it the default if you don't want to increase. typically i only use:
- default (ub=512 b=2040)
- -ub 1024
- -ub 2048
- -ub 4096 -b 4096
increased the presence penalty from 0 to 1 now its giving me proper answer
Great, glad the original Qwen3 model card had the answer!
P.S. No need for
-amb
with Qwen3, that is only for DeepSeek/Kimi-K2.Also it is so strange you use
-ub 200 -b 200
there might be speed boost for using power of two or just leave it the default if you don't want to increase. typically i only use:
- default (ub=512 b=2040)
- -ub 1024
- -ub 2048
- -ub 4096 -b 4096
yep i am just using like this now and getting 4.5 t/sec and the response accuracy also incresed.
CUDA_VISIBLE_DEVICES="0" ./bin/llama-server --model "/home/gopi/Qwen3-235B-A22B-Thinking-2507-IQ5_K-00001-of-00004.gguf" --ctx-size 20764 -fa -amb 512 -fmoe -ser 7,1 --n-gpu-layers 95 --override-tensor exps=CPU --parallel 1 --threads 28 --threads-batch 28 --run-time-repack --host 127.0.0.1 --port 8080 (pp 0.5)