Does not support multimodal input

#5
by RamboRogers - opened

Is there a way to make this mulitmodal like Gemma3 is?

Unsloth AI org

Is there a way to make this mulitmodal like Gemma3 is?

Our upload does support multimodal.What are you using this on?

Is there a way to make this mulitmodal like Gemma3 is?

+1

When trying to use the following conversation structure on unsloth/gemma-3-27b-it-unsloth-bnb-4bit

messages = [ { "role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}] }, { "role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, {"type": "text", "text": "Describe this image ."} ] } ]

The model returns the following:

["user\nYou are a helpful assistant.\n\nDescribe this image.\nmodel\nOkay, let's describe the image!\n\nThe image shows a cozy and inviting living room scene. Here's a breakdown of what I see:\n\n* **Setting:** It appears to be a living room, likely in a home. The style is warm and inviting, with a focus on comfort.\n"]

While the content of the image is completely different. I also tried to give different URLs as input, but the answer is identical ("The image shows a cozy and inviting...")

Unsloth AI org

Is there a way to make this mulitmodal like Gemma3 is?

+1

where are you using this on?

Maybe it's something that needs to be set in Ollama?

For Ollama I'm unsure exactly what the issue is but I asked the Ollama folks, they have a unique way of doing GGUFs as they integrate the mmproj intoi the actual file so unfortunately there's nothing we can do about it if it still doesn't work. Hopefully they will support separate mmproj files in the future :(

It works on other places like llama.cpp etc

CC: @kallortz @RamboRogers

Sign up or log in to comment