Thank you!

by mtcl - opened 30 days ago

Discussion

mtcl

30 days ago

My fav chef is busy in the kitchen again. Just wanted to express my gratitude!

Thank you!

ubergarm

Owner 30 days ago

The IQ2_KL just landed, bigger quants coming soon. I'll likely re-use the recipes from the Instruct version for now and revist later if there are any special requests or I find better recipes. Cheers and thanks for the encouragement! Have fun with all your GPUs haha

ubergarm

Owner 30 days ago

Okie, finished with the initial quants and perplexity testing graph is up! Choose your own adventure!

anikifoss

29 days ago

•

edited 29 days ago

Thanks, IQ2 quant is a good size to try on four MI50s.

Qwen made this week very busy! :)

gopi87

29 days ago

guys why this model is thinking continuously any one knows any idea where i should correct it ?

ubergarm

Owner 29 days ago

guys why this model is thinking continuously any one knows any idea where i should correct it ?

are you using llama-server and the chat endpoint? or are you using the text completions endpoint? my first guess would be confirm your client is using the correct template or endpoint. otherwise there may be a way to set the probability of the token higher somehow to try to "reduce" thinking? hah..

Also how much is "continuously"? From the Qwen3 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507

NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.

For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.

Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

So it might naturally be a yappy thinker.

gopi87

29 days ago

•

edited 29 days ago

guys why this model is thinking continuously any one knows any idea where i should correct it ?

are you using llama-server and the chat endpoint? or are you using the text completions endpoint? my first guess would be confirm your client is using the correct template or endpoint. otherwise there may be a way to set the probability of the token higher somehow to try to "reduce" thinking? hah..

Also how much is "continuously"? From the Qwen3 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507

NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks.

For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.

Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.

So it might naturally be a yappy thinker.

CUDA_VISIBLE_DEVICES="" ./bin/llama-server --model "/home/gopi/Qwen3-235B-A22B-Thinking-2507-IQ5_K-00001-of-00004.gguf" --ctx-size 81920 -fa -amb 512 -fmoe --n-gpu-layers 0 -b 200 -ub 200 --parallel 1 --threads 52 --threads-batch 52 --temp 0.7 --min-p 0.01 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

i am just running like this and its just thinking and thinking lol ;p

update

increased the presence penalty from 0 to 1 now its giving me proper answer

ubergarm

Owner 29 days ago

•

edited 29 days ago

increased the presence penalty from 0 to 1 now its giving me proper answer

Great, glad the original Qwen3 model card had the answer!

P.S. No need for -amb with Qwen3, that is only for DeepSeek/Kimi-K2.

Also it is so strange you use -ub 200 -b 200 there might be speed boost for using power of two or just leave it the default if you don't want to increase. typically i only use:

default (ub=512 b=2040)
-ub 1024
-ub 2048
-ub 4096 -b 4096

gopi87

28 days ago

increased the presence penalty from 0 to 1 now its giving me proper answer

Great, glad the original Qwen3 model card had the answer!

P.S. No need for -amb with Qwen3, that is only for DeepSeek/Kimi-K2.

Also it is so strange you use -ub 200 -b 200 there might be speed boost for using power of two or just leave it the default if you don't want to increase. typically i only use:

default (ub=512 b=2040)

-ub 1024

-ub 2048

-ub 4096 -b 4096

yep i am just using like this now and getting 4.5 t/sec and the response accuracy also incresed.

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server --model "/home/gopi/Qwen3-235B-A22B-Thinking-2507-IQ5_K-00001-of-00004.gguf" --ctx-size 20764 -fa -amb 512 -fmoe -ser 7,1 --n-gpu-layers 95 --override-tensor exps=CPU --parallel 1 --threads 28 --threads-batch 28 --run-time-repack --host 127.0.0.1 --port 8080 (pp 0.5)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment