Qwen is loosing broad knowledge since Qwen2.

#16

by phil111 - opened about 15 hours ago

about 15 hours ago

LLMs gained broad knowledge with each generation until around the release of Qwen2 and Llama 3, then they started to hallucinate progressively more across all popular domains of knowledge.

This correlated with the rise in math and coding scores, but also with native multimodal support (Llama 4). Apparently overtraining models to boost performance on select tasks like coding, math and multimedia is scrambling their weights.

For example, Qwen2 72b got a perfect answer when asked about the cast of the most popular Canadian TV show (number one every year that it aired). Yet Qwen3, despite being much larger, flooded the response with hallucinations (pasted below). This lack of accurate knowledge can't be rectified by RAG because it's causing organic tasks like story writing and conversations to be flooded with contradictions and absurdities.

Again, this isn't unique to Qwen3. All newer open source models (e.g. Llama 4, Phi4, and the versions of Mistral Small since 2409) are experiencing a HUGE spike in hallucinations across all popular domains of knowledge, including movies, music, TV, sports, games, and literature, as their math, science, and coding knowledge are moderately improved.

That is, they aren't really improving. They're just trading general knowledge and abilities for gains on select domains. It's time to start creating external math and coding agents that are called on by a master general purpose LLM instead of obsessively training general purpose LLMs on trillions of math and coding tokens until there's a general scrambling of all the previously trained weights.

Qwen2 72b (perfect)

Brent Leroy (Brent Butt)
Lacey Burrows (Gabrielle Miller)
Oscar Leroy (Eric Peterson)
Emma Leroy (Janet Wright)
Davis Quinton (Lorne Cardinal)
Karen Pelly (Tara Spencer-Nairn)
Corner Gas first aired in 2004.

Qwen3 235b (all scrambled)

Brent Larssen – Brent Butt
Emma Hunter – Nancy Robertson
Officer Karen Pelly – Janet Wright
Lorne Karns – Fred Ewanuick
Wanda Dollard – Gabrielle Miller
Mayor Oscar Nahatassin – Eric Peterson

ZiggyS

about 14 hours ago

They're just trading general knowledge and abilities for gains on select domains.

So just like people, they become experts in a field. I don't have a problem with that personally.

phil111

about 14 hours ago

@ZiggyS Yes, powerful coding and math AI models are useful, so I also don't have a problem with them becoming experts in those fields.

But would you put a coding nerd at the receptionist desk? A general purpose AI model that's usable to the general English speaking population can't just be good at math, coding, and science, and horrible at everything else. Qwen3 is basically just a coding and math tool masquerading as an AI model and hallucinates like crazy across all the most popular domains of human knowledge.

ZhangRC

about 14 hours ago

I should say that the primary function of a language model is language, not coding and math. Organic tasks like story writing and conversations are suitable tasks to measure language intelligence.

ZiggyS

about 13 hours ago

But would you put a coding nerd at the receptionist desk?

No. Nor would i put a general purpose one there either. id want to train it on my business, and the things it needs to talk about to customers. So, targeted.

Thireus

about 13 hours ago

•

edited about 13 hours ago

It's almost as if we would need a fat knowledge expert spitting facts into < knowledge >< /knowledge > right before < think >< /think >.

ZiggyS

about 12 hours ago

It's almost as if we would need a fat knowledge expert spitting facts into < knowledge >< /knowledge > right before < think >< /think >.

Wouldn't be a bad idea. But cant that be done with RAG now?

phil111

about 11 hours ago

@ZiggyS RAG is great for simple question and answering, but even then a quick internet search is more reliable. And most other tasks are far too nuanced for RAG, such as story telling. So RAG really isn't a solution for an LLM's broad ignorance.

For example, to tell a nuanced non-contradictory story with apt metaphors, humor, and so on, requires that the relevant information is accurately stored in the weights. RAG just copies stories or forces together mismatched pieces of stories, resulting in a Frankenstein mess of a story. Even responses to simple questions often don't flow right with RAG, and sometimes have glaring grammatical and coherency errors as the LLM starts writing sentences that don't line up with the retrieved external information.

However, including a database of popular facts with LLMs, then using it to populate the knowledge tags mentioned by Thireus, could prove useful. This would force relevant and accurate facts into the working memory of LLMs, such as names, time periods, and locations, helping to keep the LLM from falling off the rails as it performs complex tasks like writing a story. But it would have to be an information dense relational database. Using an LLM as the source would just needlessly introduce hallucinations and increase compute.

ZiggyS

about 11 hours ago

that is cool, i have not yet messed much with RAG yet, and was thinking it might be a perfect fit for KB from what i was reading.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment