Qwen3 is great, but could be better.

#18
by phil111 - opened

Firstly, the Qwen3 family is notably better than other similarly sized models when it comes to coding, math, reasoning, general STEM knowledge, and a several other tasks.

However, it has some very notable weakness, such as poem writing. But most notably, it has virtually no popular knowledge, such as music, movies and the rest of pop culture.

Yes, for simple pop culture question and answering traditional RAG and web searching works great, but it no more helps with tasks like writing original poems, jokes, apt metaphors in stories, and other creative tasks than solving math or coding problems. If anything, the former is more complex and nuanced. Yet the entire AI industry is becoming obsessed with coding and STEM, which isn't surprising since it's mostly comprised of coders, and same goes for the early adopters of open source models.

Anyways, since popular knowledge is random, extensive, and hard to accurately store in LLM weights, such the song lists from albums and the casts of shows linked to their respective actors, why not fuse the weights with a small database containing only core facts about humanity's most popular information, then populate the thinking tag with the relevant information?

I tested this by manually populating the thinking tags with core information, such as a show after it got the entire cast wrong (Corner Gas), and like with humans, this core knowledge dredged up related knowledge, such as subsequently saying the gas station and diner were owned by two different characters vs the same character.

So if it works manually, then why couldn't an LLM automatically populate the thinking tags with core information from a fused database (e.g. main cast and plot) when the user's prompt references the show, movie, singer, game character, sporting event, or other pop culture subject, helping it not only jog its memory, but not hallucinate said core facts, including the main characters in movies seen by countless millions of people?

Also, it seems to me that since it would be a simple database of facts stripped of sentence structure and other ambiguity that the knowledge would prove beneficial across all supported languages.

Anyways, Qwen3 is great, but it's way out of balance and unusable as a general purpose AI model for the general non-coder population. Ending training with more creative and pop culture tokens like poetry and humor will almost certainly lower its coding, math and STEM test scores a bit, but I strongly feel it's necessary. Plus pulling the core facts about the subject referenced by the user's prompt out of an included database and putting it into the working memory of the LLM should drastically reduce the hallucination rate, which frankly is out of control. When it comes to numerous popular domains of knowledge Qwen3 30 & 34b are only on par with 1b models.

man, i'm not being critical or something, but i've seen you many times and they are all about a new model which knows little about pop culture...

Yeah, he keeps running the same of his own test questions on new models. It's not about failing this specific test though. It's about how these new models share a similar shift in behavior, resulting in these same specific test questions failing, which comes with disadvantages like already mentioned above.

I know it gets repetitive, but when you think you have something important to say there isn't a better option than to repeat it again and again... he's already given various examples before so he's not wrong, you just have to decide for yourself if you think that information is necessary for making a good language model. I personally agree.

@CHNtentes I totally get it. My repetitiveness is annoying me probably more than anyone else. But like nplguy said, what choice do I have? The entire OS AI industry shifted towards maximizing coding and math performance, including Meta with Llama 4.

And sure, this comes with advantages. Qwen3 30b 3b did better on my coding, STEM, reasoning, and math questions than any OS model I've tested, including Gemma 3 27b, Mistral Small, or Llama 3.1 70b. Plus the instruction following is the best yet, and story writing is OK, but worse than Mistral Small's and Gemma 3 27b's.

However, it performs astonishingly bad at a large number of things the general population cares most about. For example, it's called pop culture because it's POPULAR. Plus common use cases like poem writing is abysmal. It can write OK poems it regurgitates, but when prompted for original poems they lack depth and eloquence, don't even begin to respect meter, and about half the lines break the rhyming scheme.

The primary reason I feel compelled to make these points over and over is because I want what's best for the open source AI community, including the broad adoption by the general population, and these lopsided models with >10 the STEM performance than pop culture and creative performance won't allow that to happen. They need more balanced training. Plus the hallucinations about very popular things (e.g. the most popular movies, shows, songs, games, celebrities...) needs to be reduced by a couple orders of magnitude, and not with simple RAG and web access which only really helps with simple question and answering, hence the need for a dense database containing only core facts (e.g. main cast and plot) to keep LLMs from constantly falling off the rails and vomiting hallucinations about very popular things.

"culture" is for the weak. Both humans and AI have work to get done.

@CHNtentes I totally get it. My repetitiveness is annoying me probably more than anyone else. But like nplguy said, what choice do I have? The entire OS AI industry shifted towards maximizing coding and math performance, including Meta with Llama 4.

And sure, this comes with advantages. Qwen3 30b 3b did better on my coding, STEM, reasoning, and math questions than any OS model I've tested, including Gemma 3 27b, Mistral Small, or Llama 3.1 70b. Plus the instruction following is the best yet, and story writing is OK, but worse than Mistral Small's and Gemma 3 27b's.

However, it performs astonishingly bad at a large number of things the general population cares most about. For example, it's called pop culture because it's POPULAR. Plus common use cases like poem writing is abysmal. It can write OK poems it regurgitates, but when prompted for original poems they lack depth and eloquence, don't even begin to respect meter, and about half the lines break the rhyming scheme.

The primary reason I feel compelled to make these points over and over is because I want what's best for the open source AI community, including the broad adoption by the general population, and these lopsided models with >10 the STEM performance than pop culture and creative performance won't allow that to happen. They need more balanced training. Plus the hallucinations about very popular things (e.g. the most popular movies, shows, songs, games, celebrities...) needs to be reduced by a couple orders of magnitude, and not with simple RAG and web access which only really helps with simple question and answering, hence the need for a dense database containing only core facts (e.g. main cast and plot) to keep LLMs from constantly falling off the rails and vomiting hallucinations about very popular things.

I totally agree you have the right to evaluate models by your approach, but it seems that the reality now is most model providers and users emphasize STEM benchmark scores especially coding ability. Those are easier to compare and advertise.

I think i found another weak point... Ok, so i gave it a not too long story, maybe 4000tokens and asked it for a rating. It answered something like 9.5/10...great story...blablabla, and pointed at what could be improved, this and that... So i told it "Can you write this story better?", it answered "Certainly! Here's a revised version of your story....blablabla". Then it told me what was changed, but when i looked at it, it was exactly the same story lol. I couldn't find any changes.
Even funnier, i gave this "new" story to chatgpt 4o after i also asked it for a rating for the "original" one. And i asked chatgpt 4o if the "new" version of the story is better. It said, YES definitely, and started to list what was "improved". But those "improved" parts were already there...
I don't even know how to call it? Hallucinations?

@urtuuuu Yep, I noticed this same issue across most LLMs. For example, one of my prompts is providing a limerick that doesn't rhyme and asking it to make it rhyme while preserving its meaning, yet most LLMs just repeat it back word for word, or make a couple minor changes, and then claim success. Even the dumbest humans who have ever lived would never do such a thing.

Llama 3.1 8 & 70b did a good job, as did Qwen2 72b, but Qwen2.5 did poorly. And Gemma 2/3 only made minor mistakes, while Mistral Small 2402 did poorly, but at least tried, and its successor Mistral Small 2501 just repeated it back word for word. So there's a pattern of regression when it comes to such tasks as models were being overfit with coding and math.

This is because none of the models are "thinking" and haven't gained a single generalized IQ point after being trained on mountains of coding, math, and logic tokens. They need to be trained on a diverse set of instructions.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment