That SimpleQA score looks too good to be true.
Update: I tested this model at novita. There's either something profoundly wrong with the hosting of this model or the SimpleQA score is only ~3 and the 24.2 score is due to contamination. See comment below for a couple sample prompts you can test yourself.
Can't wait to test this. An English SimpleQA score of 24.2 with only 21b total parameters is jaw dropping.
No other proprietary or OS model comes close to matching that score to size ratio. For example, GPT4 mini, Mistral 24b and Gemma 3 27b all only score ~10-12, and Llama 3 70b, the current 70b- leader, only scores ~20.
I know the posted Qwen3 scores are accurate. Qwen3 30b was profoundly ignorant across most popular domains, so its 4.7 score is near where I predicted. Same goes for Qwen3 235b's score of 11. But again, a score of 24.2 for a 21b model is incredible by comparison. The test has been out for a while so I hope accidental contamination isn't responsible for too much of that score.
My first test of a new LLM is on a collection of esoteric medical knowledge questions, which I find correlates really well with SimpleQA. This is slightly edges Llama 3.3 70B level (I tried on https://novita.ai/). I'm genuinely blown away. Now, it falls short on medical reasoning, but that can be post-trained.
I tried this model out at https://novita.ai and its broad knowledge is even worse than Qwen3 30b's. Unless something is configured wrong the SimpleQA score of this model is only ~3, not 24.2. Can someone try the following prompts and report back?
- "Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?"
This returned "Kate Sutton played Robert Barone's wife, Debra Barone..." which is all kinds of wrong. This is a very popular show at Debra is the main character Ray Barone's wife, and Kate Sutton is way off.
"List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?"
Charlie Harper - Charlie Sheen
Alan Harper - Jon Cryer
Jake Harper - Ashton Kutcher
Judith Harper - Conchata Ferrell
Evelyn Harper - Holland Taylor
Chelsea Hancher - Angus T. Jones
This is one of the most watched TV shows in history and it ran for 12 years and even very small models get this right, such as Llama 3.1 3b, yet it messed up with the main cast, such as claiming Jake Harper the boy (the half in two and a half men) was portrayed by the main character's replacement Ashton Kutcher, and Jakes mother Judith played by the house keeper Conchata, and Chelse, a lover interest of Charlie's as Angus T. Jones, which is actually the boy mentioned above (and a male).
This model is getting every question wrong that even Qwen3 30b got right. However, since it got the top 100 most popular movies, shows, games, music... right, such as the TV show Friends, it doesn't appear to be a configuration issue at novita. So assuming there's nothing wrong with the hosting this model's SimpleQA score is due to contamination. It's true SimpleQA score is ~3.
Here's a few more examples. It's even messing up with more academic, and less pop culture, questions that models scoring >5 of SimpleQA reliably get right. And models that score >10 always get right. I even tried pulling the temperature down to 0.3.
- What book is the following line from "it was the age of wisdom, it was the age of foolishness,"?
The line "it was the age of wisdom, it was the age of foolishness" is from the opening lines of the poem "The Tyger" by William Blake.
- What did Kanye West change his name to?
Yeezus
This was one of his Album names, but his name change was to Ye, and more recently Ye Ye.
- Who played the comic store owner on the TV show The Big Bang Theory?
"The comic book store owner on the series The Big Bang Theory is portrayed by John Ross Bowie. He played the recurring character Barry Kripke, a Ph.D. student..."
That answer is somewhat close since Barry Kripke is on the show, but this model seems to just roll the dice.
Anyways, I've seen enough. Assuming the correct model is being hosted at https://novita.ai, and there's nothing profoundly wrong with the configuration, this model's true SimpleQA score is only ~3, and no more than 5. Models that score 20 or higher reliably get 100% of my easy set of questions right, yet this model not got most of them wrong, plus it makes inexcusable mistakes, such as returning male actors for female characters. There's no way it's SimpleQA score is >5, let alone 24.2. 24.2 would put it on par with Llama 3 405b and DeepSeek v3 which have orders of magnitude more knowledge that models with a SimpleQA score of ~5.
This is the answer I obtained using the official API,temp:0.3:
USER
Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?
ASSISTANT
In the TV show Everybody Loves Raymond, Robert Barone's wife is played by Doris Roberts. She portrayed Marie Barone, Robert's mother's sister-in-law and the matriarch of the Barone family. Doris Roberts was a well - known actress and won multiple Emmy Awards for her role as Marie.
USER
List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?
ASSISTANT
The six main characters and their actors from Two and a Half Men are:
- Charlie Harper - Charlie Sheen
- Alan Harper - Jon Cryer
- Judith Harper - Holland Taylor
- Walden Schmidt - Ashton Kutcher
- Berta - Conchata Ferrell
- Jake Harper - Angus T. Jones
The show first aired in 2003.
USER
What book is the following line from "it was the age of wisdom, it was the age of foolishness,"?
ASSISTANT
The line "it was the age of wisdom, it was the age of foolishness" is from the poem "In Memoriam A.H.H." by Alfred, Lord Tennyson. This poem was written as a memorial to Tennyson's close friend Arthur Henry Hallam, who died suddenly in 1833.
USER
What did Kanye West change his name to?
ASSISTANT
Kanye West was born Kanye Omari West. He legally changed his name to simply Kanye West in 2009.
USER
Who played the comic store owner on the TV show The Big Bang Theory?
ASSISTANT
The comic store owner on The Big Bang Theory is played by John Ross Bowie. He portrayed Barry Kripke, a character who is a physicist and a recurring figure on the show.
@gghfez Thanks for verifying.
And yes, in my testing not only does this model only have the knowledge of others with a SimpleQA score of ~3, its outputs were far less stable than any model I've tested, which is why I suspected something might be configured wrong.
For example, instead of answering the book question mentioned above it first tried continuing the line, which is odd even for a base model. And with another, despite it being a long English prompt, it responded entirely in Chinese.
@snnh Thanks for testing my prompts, and using temp 0.3 which helps mitigate factual hallucinations. Based on those responses this model's English SimpleQA score is, if anything, lower than Qwen3 30b's, and certainly nowhere near 20. I'm >99% sure this model's corpus was contaminated with the SimpleQA test.
Smaller models may not retain much knowledge of other languages, so you should try ERNIE-4.5-300B-A47B.
@snnh True, smaller models can't retain much knowledge.
However, Baidu claimed that Qwen3 30b only had a SimpleQA score of 4.7, while their Ernie 4.5 20b has a SimpleQA score of 24.2, when in fact it's lower than Qwen3's at only ~3.
Additionally, their blog even brags about it, and the base model has an even higher score, so it's not a typo (...demonstrates significant strengths in instruction following and knowledge tasks, as evidenced by the state-of-the-art scores on benchmarks such as IFEval, Multi-IF, SimpleQA...").
It's impossible to conceive of a group of people capable of making an AI model not knowing that a SimpleQA score of 24.2 with a 20b model is impossibly high, so why did they not only post it, but brag about it, especially when the true SimpleQA score is only ~3?
As a Chinese, although I really don't want to admit this, it is common for Chinese manufacturers to conduct special optimizations for benchmarks
I can submit a ticket to inquire about the situation regarding the Simple QA score. Do you need my help?
@snnh Thanks, but I just wanted to bring attention to the fact that the SimpleQA score is way off. Hopefully in the future Baidu will take better care removing test contamination from its corpus. Plus I'm hoping they'll remove the SimpleQA scores, and references to them, from their write-ups.