baidu/ERNIE-4.5-21B-A3B-PT · That SimpleQA score looks too good to be true.

28 days ago

•

Update: I tested this model at novita. There's either something profoundly wrong with the hosting of this model or the SimpleQA score is only ~3 and the 24.2 score is due to contamination. See comment below for a couple sample prompts you can test yourself.

Can't wait to test this. An English SimpleQA score of 24.2 with only 21b total parameters is jaw dropping.

No other proprietary or OS model comes close to matching that score to size ratio. For example, GPT4 mini, Mistral 24b and Gemma 3 27b all only score ~10-12, and Llama 3 70b, the current 70b- leader, only scores ~20.

I know the posted Qwen3 scores are accurate. Qwen3 30b was profoundly ignorant across most popular domains, so its 4.7 score is near where I predicted. Same goes for Qwen3 235b's score of 11. But again, a score of 24.2 for a 21b model is incredible by comparison. The test has been out for a while so I hope accidental contamination isn't responsible for too much of that score.

cmcmaster

28 days ago

My first test of a new LLM is on a collection of esoteric medical knowledge questions, which I find correlates really well with SimpleQA. This is slightly edges Llama 3.3 70B level (I tried on https://novita.ai/). I'm genuinely blown away. Now, it falls short on medical reasoning, but that can be post-trained.

phil111

28 days ago

I tried this model out at https://novita.ai and its broad knowledge is even worse than Qwen3 30b's. Unless something is configured wrong the SimpleQA score of this model is only ~3, not 24.2. Can someone try the following prompts and report back?

"Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?"

This returned "Kate Sutton played Robert Barone's wife, Debra Barone..." which is all kinds of wrong. This is a very popular show at Debra is the main character Ray Barone's wife, and Kate Sutton is way off.

"List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?"

Charlie Harper - Charlie Sheen
Alan Harper - Jon Cryer
Jake Harper - Ashton Kutcher
Judith Harper - Conchata Ferrell
Evelyn Harper - Holland Taylor
Chelsea Hancher - Angus T. Jones

This is one of the most watched TV shows in history and it ran for 12 years and even very small models get this right, such as Llama 3.1 3b, yet it messed up with the main cast, such as claiming Jake Harper the boy (the half in two and a half men) was portrayed by the main character's replacement Ashton Kutcher, and Jakes mother Judith played by the house keeper Conchata, and Chelse, a lover interest of Charlie's as Angus T. Jones, which is actually the boy mentioned above (and a male).

This model is getting every question wrong that even Qwen3 30b got right. However, since it got the top 100 most popular movies, shows, games, music... right, such as the TV show Friends, it doesn't appear to be a configuration issue at novita. So assuming there's nothing wrong with the hosting this model's SimpleQA score is due to contamination. It's true SimpleQA score is ~3.

phil111

28 days ago

•

edited 28 days ago

Here's a few more examples. It's even messing up with more academic, and less pop culture, questions that models scoring >5 of SimpleQA reliably get right. And models that score >10 always get right. I even tried pulling the temperature down to 0.3.

What book is the following line from "it was the age of wisdom, it was the age of foolishness,"?

The line "it was the age of wisdom, it was the age of foolishness" is from the opening lines of the poem "The Tyger" by William Blake.

What did Kanye West change his name to?

Yeezus

This was one of his Album names, but his name change was to Ye, and more recently Ye Ye.

Who played the comic store owner on the TV show The Big Bang Theory?

"The comic book store owner on the series The Big Bang Theory is portrayed by John Ross Bowie. He played the recurring character Barry Kripke, a Ph.D. student..."

That answer is somewhat close since Barry Kripke is on the show, but this model seems to just roll the dice.

Anyways, I've seen enough. Assuming the correct model is being hosted at https://novita.ai, and there's nothing profoundly wrong with the configuration, this model's true SimpleQA score is only ~3, and no more than 5. Models that score 20 or higher reliably get 100% of my easy set of questions right, yet this model not got most of them wrong, plus it makes inexcusable mistakes, such as returning male actors for female characters. There's no way it's SimpleQA score is >5, let alone 24.2. 24.2 would put it on par with Llama 3 405b and DeepSeek v3 which have orders of magnitude more knowledge that models with a SimpleQA score of ~5.

gghfez

28 days ago

Yeah it's failed all my "secret test questions" (transformers version, full precision) as well. And it's writing is weird + looks like it's been trained on Deepseek outputs.

I don't know anything about that series but there's it's response to your prompt using transformers.

snnh

28 days ago

This is the answer I obtained using the official API,temp:0.3:

USER

Who played Robert Barone’s wife on the TV show Everybody Loves Raymond?

ASSISTANT

In the TV show Everybody Loves Raymond, Robert Barone's wife is played by Doris Roberts. She portrayed Marie Barone, Robert's mother's sister-in-law and the matriarch of the Barone family. Doris Roberts was a well - known actress and won multiple Emmy Awards for her role as Marie.

USER

List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?

ASSISTANT

The six main characters and their actors from Two and a Half Men are:

Charlie Harper - Charlie Sheen
Alan Harper - Jon Cryer
Judith Harper - Holland Taylor
Walden Schmidt - Ashton Kutcher
Berta - Conchata Ferrell
Jake Harper - Angus T. Jones

The show first aired in 2003.

USER

What book is the following line from "it was the age of wisdom, it was the age of foolishness,"?

ASSISTANT

The line "it was the age of wisdom, it was the age of foolishness" is from the poem "In Memoriam A.H.H." by Alfred, Lord Tennyson. This poem was written as a memorial to Tennyson's close friend Arthur Henry Hallam, who died suddenly in 1833.

USER

What did Kanye West change his name to?

ASSISTANT

Kanye West was born Kanye Omari West. He legally changed his name to simply Kanye West in 2009.

USER

Who played the comic store owner on the TV show The Big Bang Theory?

ASSISTANT

The comic store owner on The Big Bang Theory is played by John Ross Bowie. He portrayed Barry Kripke, a character who is a physicist and a recurring figure on the show.

phil111

28 days ago

@gghfez Thanks for verifying.

And yes, in my testing not only does this model only have the knowledge of others with a SimpleQA score of ~3, its outputs were far less stable than any model I've tested, which is why I suspected something might be configured wrong.

For example, instead of answering the book question mentioned above it first tried continuing the line, which is odd even for a base model. And with another, despite it being a long English prompt, it responded entirely in Chinese.

phil111

28 days ago

@snnh Thanks for testing my prompts, and using temp 0.3 which helps mitigate factual hallucinations. Based on those responses this model's English SimpleQA score is, if anything, lower than Qwen3 30b's, and certainly nowhere near 20. I'm >99% sure this model's corpus was contaminated with the SimpleQA test.

snnh

28 days ago

This comment has been hidden (marked as Off-Topic)

snnh

28 days ago

Smaller models may not retain much knowledge of other languages, so you should try ERNIE-4.5-300B-A47B.

phil111

28 days ago

@snnh True, smaller models can't retain much knowledge.

However, Baidu claimed that Qwen3 30b only had a SimpleQA score of 4.7, while their Ernie 4.5 20b has a SimpleQA score of 24.2, when in fact it's lower than Qwen3's at only ~3.

Additionally, their blog even brags about it, and the base model has an even higher score, so it's not a typo (...demonstrates significant strengths in instruction following and knowledge tasks, as evidenced by the state-of-the-art scores on benchmarks such as IFEval, Multi-IF, SimpleQA...").

It's impossible to conceive of a group of people capable of making an AI model not knowing that a SimpleQA score of 24.2 with a 20b model is impossibly high, so why did they not only post it, but brag about it, especially when the true SimpleQA score is only ~3?

snnh

28 days ago

As a Chinese, although I really don't want to admit this, it is common for Chinese manufacturers to conduct special optimizations for benchmarks

snnh

28 days ago

I can submit a ticket to inquire about the situation regarding the Simple QA score. Do you need my help?

phil111

28 days ago

@snnh Thanks, but I just wanted to bring attention to the fact that the SimpleQA score is way off. Hopefully in the future Baidu will take better care removing test contamination from its corpus. Plus I'm hoping they'll remove the SimpleQA scores, and references to them, from their write-ups.

gghfez

27 days ago

As a Chinese, although I really don't want to admit this, it is common for Chinese manufacturers to conduct special optimizations for benchmarks

I think that happens everywhere mate, not just China (see llama-4). And in this case it could just be a mistake (this is their first LLM release right?).
Let's not forget you guys produced the best open-weights LLM (DeepSeek-R1) and multi-modal (Bagel) :)

True, smaller models can't retain much knowledge.

The larger one is on Open Router and seems a lot better, and not as SLOP-y! It's getting hammered atm though so will have to test it more later.

snnh

27 days ago

As a Chinese, although I really don't want to admit this, it is common for Chinese manufacturers to conduct special optimizations for benchmarks

I think that happens everywhere mate, not just China (see llama-4). And in this case it could just be a mistake (this is their first LLM release right?).
Let's not forget you guys produced the best open-weights LLM (DeepSeek-R1) and multi-modal (Bagel) :)

True, smaller models can't retain much knowledge.

The larger one is on Open Router and seems a lot better, and not as SLOP-y! It's getting hammered atm though so will have to test it more later.

Thank you. The larger one can be called on Baidu Qianfan, and currently, it responds quite fast. If you have any questions you need to test, I can help

mrwkd123

27 days ago

This comment has been hidden (marked as Off-Topic)

mrwkd123

27 days ago

This comment has been hidden (marked as Abuse)

mrwkd123

27 days ago

As a Chinese, although I really don't want to admit this, it is common for Chinese manufacturers to conduct special optimizations for benchmarks

。。。。。。