Tokenizer or template bug

by beijinghouse - opened May 21

Discussion

beijinghouse

May 21

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

shimmyshimmer

Unsloth AI org May 26

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

Hi there I tried it in llama.cpp and the error doesn't occur. Do you know if it's specifically for the Q4 XL quant?

beijinghouse

May 26

Yes 100% Unsloth Q4 XL

Llama.cpp b5423. Didn't modify sampler settings. Occurred very first attempt to use model so assumed it would be easy to reproduce. Prompt was something like "describe all medications that can be used to treat X".

cbunivofutah

Jun 13

•

edited Jun 13

I am also experiencing this issue using Q8_K_XL with ollama. Not sure how to fix or if it is just a weird quirk of the model itself. This model feels like they did SFT over reasoning traces tbh.

pipilok

Jun 15

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

Hi there I tried it in llama.cpp and the error doesn't occur. Do you know if it's specifically for the Q4 XL quant?

Same problem Q4_K_XL:

thought
Thinking Process:
....

any solution?

altaweel

Jun 19

it seems the 27b model was trained to think, the trigger is a specific prompt (although tot always accurate in my testing)
here is an interesting bit from Google notebook on this model: https://github.com/Google-Health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb

from IPython.display import Markdown

prompt = "How do you differentiate bacterial from viral pneumonia?" # @param {type: "string"}

role_instruction = "You are a helpful medical assistant."
if "27b" in model_variant and is_thinking:
system_instruction = f"SYSTEM INSTRUCTION: think silently if needed. {role_instruction}"
max_new_tokens = 1500
else:
system_instruction = role_instruction
max_new_tokens = 500

messages = [
{
"role": "system",
"content": system_instruction
},
{
"role": "user",
"content": prompt
}
]

cbunivofutah

Jun 19

it seems the 27b model was trained to think, the trigger is a specific prompt (although tot always accurate in my testing)
here is an interesting bit from Google notebook on this model: https://github.com/Google-Health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb

Nice find. Thanks!

rkasem0

Jun 23

Having the same issue the response includes thought. ........ and it takes a lot of tokens,.

ex:
Before cleaning agent produced 8129 characters of response
After cleaning agent produced 3953 characters of response

altaweel

about 1 month ago

I explicitly instructed the model not to think and it worked most of the time but as the conversation gets longer or the task is more complicated then it starts to think. My solution for now is to filter the thought in the special token but sometimes the model doesn’t close it or stops halfway through the thinking process.

Now if I detect a thought the I reject the response and tell it to stop thinking and try again. This is tedious, but I really like this model output so far as far as medical knowledge.

The thought process is not really great because it repeats the same thought in the response anyways so there isn’t much value in it

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment