Benchmarks please

#20
by Blazgo - opened

I want the benchmarks.. helps a lot wonder if it's better than qwen3

It's not even close. Qwen3, like DeepSeek R1, is good at coding, math, STEM, and a couple other things like language translations.

However, Deepseek is good across the board, including at creative writing & poetry, plus has orders of magnitude more broad knowledge. For example, on my English broad knowledge test DeepSeek R1 beats Llama 3.3 70b, while Qwen3 235b barely matches Llama 3.2 3b. This is redundantly confirmed by the their English SimpleQA scores (DeepSeek's 25-30 vs Qwen3 235b's ~10).

Qwen3 235b is a grossly overfit mess. Normally when you train on a larger corpus, in this case over 30 trillion tokens, broad knowledge and SimpleQA scores go up per parameter count. But Alibaba so grossly overfit coding, math, and STEM that its much older and smaller Qwen2 72b has vastly more broad knowledge and a notably higher SimpleQA score (~18).

Thats what I was thinking.

It's not even close. Qwen3, like DeepSeek R1, is good at coding, math, STEM, and a couple other things like language translations.

However, Deepseek is good across the board, including at creative writing & poetry, plus has orders of magnitude more broad knowledge. For example, on my English broad knowledge test DeepSeek R1 beats Llama 3.3 70b, while Qwen3 235b barely matches Llama 3.2 3b. This is redundantly confirmed by the their English SimpleQA scores (DeepSeek's 25-30 vs Qwen3 235b's ~10).

Qwen3 235b is a grossly overfit mess. Normally when you train on a larger corpus, in this case over 30 trillion tokens, broad knowledge and SimpleQA scores go up per parameter count. But Alibaba so grossly overfit coding, math, and STEM that its much older and smaller Qwen2 72b has vastly more broad knowledge and a notably higher SimpleQA score (~18).

What's your "English broad knowledge test " looks like? I don't believe that Qwen3 235b = Llama 3.2 3b

@Alnsven Good question, and you're right to be confused. The English SimpleQA score of Llama 3.2 3b isn't released, but it's no more than ~3-4, so much lower than Qwen3 235b's ~10.

However, I specifically designed my test to cover all popular domains of knowledge that aren't covered by the MMLU and GPQA, including TV shows, music, games, sports, and movies, while SimpleQA includes all major domains of knowledge, including STEM. Consequently, strong STEM models like Qwen3, which have very high MMLU scores (~85 vs Llama 3.2 3b's ~58) get far more of these questions right, hence their higher scores.

DeepSeek is about equally strong at STEM (has comparable MMLU and GPQA scores), yet scores ~30 vs ~10 on the English SimpleQA with thinking enabled. This is because unlike Qwen3, DeepSeek isn't grossly overfit to the small subset of popular knowledge covered by the MMLU and GPQA, hence picks up about 20 more points across various domains of knowledge.

Here's an example, "Who portrayed Monica’s father in the sitcom Friends?" The quantized Llama 3.2 3b got this right, but the full float Qwen3 235b commonly gets it wrong (e.g. "Monica's father, Jack Geller, was portrayed by Ron Leibman..." rather than Elliott Gould). Another example is "Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?", which again Llama 3.2 3b got right, but Qwen3 235b returned "The modern hit song "Dear Future Husband" was sung by Tori Kelly. It is from her debut studio album "Unbreakable Smile", which was released on May 22, 2015" instead of Meghan Trainor, 2015, Album is Title).

If anything I was being too generous. Qwen3 235b, while far more knowledgeable about STEM, and to a lesser degree general academia, than Llama 3.2 3b, it's notably less knowledgeable about all other popular domains of knowledge. So again I reiterate, Qwen3 is a grossly overfit model.

@phil111 Qwen3 also hallucinates STEM general knowledge as well, I don't really use it anymore.

Is this new checkpoint of R1 still good for general knowledge I haven't had a chance to test it yet.

Side note since you seem to be across SimpleQA: Do you know if anyone has run it on command-a? If so, could you link me?

@gghfez I also noticed odd hallucinations with Qwen3 when it comes to very basic STEM knowledge. But when I directly ask about the hallucinated info with perfect spelling, grammar etc. it proves to be very knowledgeable of STEM.

My theory is this is caused by Alibaba filtering out too much redundancy from the corpus (e.g. The earth is the third rock from the sun. & The earth is the third rocky planet from the sun. & so on). So when you ask more colloquially with "What is the third rock from the sun." it struggles. You apparently need to ask in the proper way.

And yes, I spot checked this new R1 with thinking enabled, but with only 2 tricky prompts most LLMs get wrong, and it got them both right, so its broad knowledge is likely the same as the earlier R1, and a little higher than Llama 3.1 405b.

In regards to Command-a, I never tested it, or saw its SimpleQA score. I did, however, test the much smaller Command-r 34b 082024 and it scored a solid 75.4/100 on my test, which was much higher than the previous version of Command-r 34b, so it has an OK amount of broad knowledge. For comparison, Gemma 3 27b scored 74.6/100 and Mistral Small 2409 22b scored 82.6 (the 34b- leader). But newer versions of Mistral Small started overfitting math, coding, STEM and media, hence dropped to 75.4 (v2501), then 74.6 (v2503).

@gghfez Out of curiosity I ran several hard prompts on Command-a and it got them all right. I'm very impressed with it.

For example, "What are the 6 main characters, and the actors who portrayed them, on the TV show Corner Gas? Don't add details, just list them. And what year did the show first air?"

"
Brent Leroy (Brent Butt)
Wanda Dollard (Nancy Robertson)
Hank Yarbo (Fred Ewanuick)
Lacey Burrows (Gabrielle Miller)
Karen Pelly (Tara Spencer-Nairn)
Oscar Leroy (Eric Peterson)

The show first aired in 2004."

It even got the hard to parse question after it corrected itself "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?".

" Holland Taylor as Evelyn Harper (Alan's mother, not an ex-wife, but a notable character in the show).

Note: There seems to be a mix-up in the question. Alan Harper's ex-wives in the show are:
Judith Harper-Melnick, played by Marin Hinkle.
Kandi Harper, played by April Bowlby."

perfect spelling, grammar etc

Yes, God forbid I rush a prompt / have a typo, and Qwen wasts 2k tokens fixated on that lol.

Out of curiosity I ran several hard prompts on Command-a and it got them all right. I'm very impressed with it.

Yeah it's an under rated model imo. Glad I brought it to your attention.
I've found it's able to stay on track at longer contexts better than the command-r/r+ models.

higher than the previous version of Command-r 34b

At shorter contexts for sure, but at >16k tokens, the previous one seemed better in my experience. I suspect it's the trade off for GQA in the newer one.

I just tested this new R1 a bit, and it's an improvement over the OG for reasoning about fiction (sequence of events / timeline in complex stories). SOTA for this IMO.
It also called out a typo, but didn't fixate too much. And with correct spelling, it's able to answer tricky questions with less reasoning tokens than to OG R1!
I hope this new R1 quantizes well 🤞

@phil111 bad news, r1 0528 has a lower simpleqa score, not much though.

@CHNtentes Thanks for reporting that the score is posted.

Overtraining a model reliably scrambles the weights a bit, leading to an increase in factual hallucinations, hence the SimpleQA score drop from 30.1 to 27.8. That's still slightly higher than Llama 3.1 405b so it's likely still the open source broad knowledge king. But GPT4o & Gemini 2.5 still have far more knowledge than DeepSeek (up to 53, and GPT4.5 is 63). And the nuanced understanding these large proprietary models have over open source models is very pronounced. It almost feels like you're interacting with a human when using models like Gemini 2.5 & GPT4.5.

Sign up or log in to comment