Only 9.3 on the English SimpleQA despite 143b total parameters

by phil111 - opened Jun 7

Jun 7

•

Edit: I played around with this model a bit and it has more broad knowledge than I expected considering its low 9.3 English SimpleQA score.

Still, a 143 billion total parameter model should at least achieve a score of 20. Even Mistral Small 24b and Gemma 3 27b score a little higher.

AnA202

Jun 7

•

edited Jun 7

hey, i am not from the team but i think i have theory for this result.
first we need to understand that they train the model without syntetic data, then we need to also acklowledge that they originated from china and their reddot app are dominated by chines people.
with only this information we can already determine that they will have mostly data in chinese, i dont see this as a negative point.

but ofc i hope later on they could improve it while keep on valuing non-syntetic so they still able to retain their different feels compare to model we have right now

phil111

Jun 7

This comment has been hidden (marked as Resolved)

redmoe-ai-v1 changed discussion status to closed Jun 9

phil111

Jun 9

According to your paper you maintained a 1:1 token training ratio between English and Chinese. At first glance this seems fair and reasonable; however, since there is more available English training tokens from sources like the WWW and digitized books than all other languages combined the only way to achieve said 1:1 English to Chinese ratio is to far more aggressively filter the English tokens, which I'm assuming is why this model achieved a good Chinese SimpleQA score relative to its total parameter count while achieving a very low English SimpleQA score (<10) for its size.

Point being, since there's far more available English tokens you either need to up the ratio between English and Chinese or improve the filtering so the damage caused by far more aggressively filtering the English tokens is mitigated.

redmoe-ai-v1 changed discussion status to open Jun 10

redmoe-ai-v1

rednote-hilab org Jun 10

Thank you for your feedback! I’ve reopened the channel for further discussion.

Your point about enhancing the quality and value of English tokens is insightful and much appreciated. We are actively working on processing larger volumes of data and implementing more fine-grained data filtering methods for pretraining.

phil111

Jun 10

Thanks for reopening this discussion but this model's general knowledge appears to be better than the 9.3 SimpleQA score suggests. An issue with the test is that nearly all of the questions are esoteric, so gaining knowledge in the covered domains rarely adds points until a threshold is crossed. This is probably why so many models plateau around 10, then pick up again between 20-65.

phil111 changed discussion status to closed Jun 10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment