Tokenizer or template bug

#1
by beijinghouse - opened

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

Unsloth AI org

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

Hi there I tried it in llama.cpp and the error doesn't occur. Do you know if it's specifically for the Q4 XL quant?

Yes 100% Unsloth Q4 XL

Llama.cpp b5423. Didn't modify sampler settings. Occurred very first attempt to use model so assumed it would be easy to reproduce. Prompt was something like "describe all medications that can be used to treat X".

I am also experiencing this issue using Q8_K_XL with ollama. Not sure how to fix or if it is just a weird quirk of the model itself. This model feels like they did SFT over reasoning traces tbh.

Q4_K_XL quant of MedGemma-27-text-it-GUFF In llama.cpp

initial response begins:

<unused94>thought
Here's a breakdown of the thinking process...

Hi there I tried it in llama.cpp and the error doesn't occur. Do you know if it's specifically for the Q4 XL quant?

Same problem Q4_K_XL:

thought
Thinking Process:
....

any solution?

it seems the 27b model was trained to think, the trigger is a specific prompt (although tot always accurate in my testing)
here is an interesting bit from Google notebook on this model: https://github.com/Google-Health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb

from IPython.display import Markdown

prompt = "How do you differentiate bacterial from viral pneumonia?" # @param {type: "string"}

role_instruction = "You are a helpful medical assistant."
if "27b" in model_variant and is_thinking:
system_instruction = f"SYSTEM INSTRUCTION: think silently if needed. {role_instruction}"
max_new_tokens = 1500
else:
system_instruction = role_instruction
max_new_tokens = 500

messages = [
{
"role": "system",
"content": system_instruction
},
{
"role": "user",
"content": prompt
}
]

it seems the 27b model was trained to think, the trigger is a specific prompt (although tot always accurate in my testing)
here is an interesting bit from Google notebook on this model: https://github.com/Google-Health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb

Nice find. Thanks!

Having the same issue the response includes thought. ........ and it takes a lot of tokens,.

ex:
Before cleaning agent produced 8129 characters of response
After cleaning agent produced 3953 characters of response

I explicitly instructed the model not to think and it worked most of the time but as the conversation gets longer or the task is more complicated then it starts to think. My solution for now is to filter the thought in the special token but sometimes the model doesn’t close it or stops halfway through the thinking process.

Now if I detect a thought the I reject the response and tell it to stop thinking and try again. This is tedious, but I really like this model output so far as far as medical knowledge.

The thought process is not really great because it repeats the same thought in the response anyways so there isn’t much value in it

Sign up or log in to comment