Add reasoning capabilities for gemma 3

#66
by devopsML - opened

Since google's own proprietary AI model gemini 2.5 pro already has reasoning capabilities by default, wouldn't it be better for the devs to add the same capability (optionally) for gemma 3 or its future release gemma 4?

We hope gemma 3 will be upgraded with high-performant reasoning capabilities!

Hi @devopsML ,

Gemma 3 does have reasoning capabilities. It’s built to handle complex tasks using both text and visual inputs. With multimodal reasoning and a large 128K-token context window, it can process information from various sources and respond with strong contextual understanding. Could you please refer to this blog.

Pro models like Gemini 2.5 Pro offer more advanced features because they have far larger parameter sizes than open-source models like Gemma 3. As a result, their performance is typically stronger. That said, efforts are ongoing to bring more advanced features to open-source models as well.

Thank you.

image.png

We do not think so, because if you look closely in this image you will see that gemma 3 (27-b-it) answers instantly without reasoning at all. it should have reasoning capabilities on huggingchat either via the default /think or /nothink command like qwen 3.

We hope you will reconsider your arguments and take any improvements on gemma 3.

Thanks for the feedback. We will definitely consider this for future improvements to Gemma 3.

@devopsML Reasoning can be beneficial, but is over-hyped, especially when it comes to smaller models.

The "reasoning" tokens are basically just minimizing stupid mistakes (e.g. undeclared variables and arithmetic errors) by maximize the retrieval accuracy of trained "knowledge". This is why reasoning models like Qwen3 can only solve math, coding, and other problems that humans have already solved, and not a single original or unsolved problem.

So small thinking models aren't even beginning to solve more complex or original problems than their non-thinking counterparts, and are primarily only seeing bumps on math, coding, and other problem solving test scores through the reduction of stupid blunders which a reasonably competent human user can either fix on his own, or use a follow-up prompt to do so.

This partially explains why on LMsys the relatively small non-thinking Gemma 3 27b is performing on par with the much larger Qwen3 253b with thinking enabled. Another big reason is likely because Qwen3 is PROFOUNDLY ignorant for its size. Even its largest 252b model scores lower on my broad knowledge test than Mistral Small 22b, Cohere 34b, Gemma 3 27b, and even Gemma 2 9b. And it only scores <11 on the English SimpleQA, with most of those 11 points coming from the STEM set of questions.

Point being, Gemma 3 27b did something right and is undeniably the most generally capable and powerful small LLM in existence, so if it ain't broke, why fix it? Plus it somehow manages to do a good job "thinking" without using pre-thinking tokens, such as creating quality original stories and poems that respect the user's prompt directives, which thinking models like Qwen3 can't do.

That is, Gemma 3 27b's poems are VASTLY superior to Qwen3's, even with thinking, having more depth, eloquence, rhyme & meter adherence, and fewer contradictions to the user's prompt. And when it comes to stories Qwen3 can be very good. However, when prompted to write an original story (e.g. a list of inclusions and exclusions) the prompt contradictions go up as the story quality goes down. For example, it will either ignore a directive or repeatedly and heavy-handedly bring focus to it, breaking the organic flow of the story. Plus the story's eloquence and self-consistency plummets. In contrast, Gemma 2 27b can reliably respect the user's directives to write an original story with a much smaller drop in writing quality. So it somehow manages to "think" better than Qwen3 across numerous tasks despite not using pre-thinking tokens.

Gemma 2 27b's biggest weakness isn't the lack of pre-"thinking" tokens, but its high hallucination rate when it comes to very popular knowledge. If Google can somehow find a way to seamlessly fuse the weights with a small relational database of popular core knowledge , then user confidence in its outputs would be drastically improved. It only takes ~100 core words per popular movie, show, album, game... to drastically reduce factual hallucinations and seed the more accurate recollection related information. For example, when I pasted the main cast of popular movies/shows and album song lists into Qwen3's thinking tokens its factual errors when it came to other information in said show, movie, album, game... went down. So including a small skeleton database of core facts about popular things that's much smaller than the LLM can drastically reduce its factual hallucination rate.

Sign up or log in to comment