Any way to disable reasoning?

#1
by aoleg - opened

Basically, it works - thank you for this model! But do you know of a way to disable reasoning? It does not seem to accept common tokens like /nothing or /no_think, and I have no idea on how to access the template via llama.cpp.

Guess you can set thinking budget to 0, but you will still get budget reflect.
I guess need some jinja template modifications to use budget reflect only when budget > 0.

Example response with budget 0:

The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>
</seed:think>Hello! How can I help you today?

I think it's impossible to disable as it's primary usage is reasoning

Yeah, settings budget to 0 is nice trick

So how do you do it, exactly? Do you edit the jinja template, or just prompt?

Got it working in llama-cli by using this format:

llama-cli -m Seed_OSS_36B_Instruct_Q4_K_M.gguf --ctx-size 32768 --n-gpu-layers 99 --temp 1.1 --top-p 0.95 --no-mmap --flash-attn --cache-type-k f16 --cache-type-v f16 --jinja --chat-template-file chat_template.jinja

Downloaded chat_template.jinja from the original model and changed one line:

{%- set thinking_budget = 0 -%}

Works fine in CLI; for some reason doesn't work with llama-server, don't know why. But at least this is something already. Also, the model seems to ignore its own rules, and outputs think_end_token without the think_begin_token, so I guess this is one of those cases where a model ships with a borked template. Hope unsloth or another team can fix it.

Actually, there is an even simpler way. The jinja template contains the following system message for thinking_budget == 0 :

You are an intelligent assistant that can answer questions in one step without the need for reasoning and thinking, that is, your thinking budget is 0. Next, please skip the thinking process and directly start answering the user's questions.

So just adding that to the system prompt disables thinking. Maybe there is an even shorter version, I'll experiment with that.

Got it working in llama-cli by using this format:

llama-cli -m Seed_OSS_36B_Instruct_Q4_K_M.gguf --ctx-size 32768 --n-gpu-layers 99 --temp 1.1 --top-p 0.95 --no-mmap --flash-attn --cache-type-k f16 --cache-type-v f16 --jinja --chat-template-file chat_template.jinja

Downloaded chat_template.jinja from the original model and changed one line:

{%- set thinking_budget = 0 -%}

Works fine in CLI; for some reason doesn't work with llama-server, don't know why. But at least this is something already. Also, the model seems to ignore its own rules, and outputs think_end_token without the think_begin_token, so I guess this is one of those cases where a model ships with a borked template. Hope unsloth or another team can fix it.

You can add --chat-template-kwargs '{"thinking_budget": 0}'

Also, the Jinja template is not baked into the model config, so convert-hf-to-gguf.py doesn't retrieve it. You can bake it in after conversion or just use --chat-template-file with the downloaded official template.

The chat template is built into the gguf; works just fine without extrenal template file. --chat-template-kwargs '{"thinking_budget": 0}' is a good solution, but after looking at the chat template, it seems that the only thing it does is just setting that system prompt that I mentioned.

And the final update: koboldcpp just pushed an update, including both thinking and non-thinking chat templates: https://imgur.com/a/LTaWq7t
Both templates work great.

aoleg changed discussion status to closed

Sign up or log in to comment