Only 9.3 on the English SimpleQA despite 143b total parameters
Edit: I played around with this model a bit and it has more broad knowledge than I expected considering its low 9.3 English SimpleQA score.
Still, a 143 billion total parameter model should at least achieve a score of 20. Even Mistral Small 24b and Gemma 3 27b score a little higher.
hey, i am not from the team but i think i have theory for this result.
first we need to understand that they train the model without syntetic data, then we need to also acklowledge that they originated from china and their reddot app are dominated by chines people.
with only this information we can already determine that they will have mostly data in chinese, i dont see this as a negative point.
but ofc i hope later on they could improve it while keep on valuing non-syntetic so they still able to retain their different feels compare to model we have right now