@etemiz on Hugging Face: "It looks like Llama 4 team gamed the LMArena benchmarks by making their…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

etemiz

posted an update Apr 11

Post

2187

It looks like Llama 4 team gamed the LMArena benchmarks by making their Maverick model output emojis, longer responses and ultra high enthusiasm! Is that ethical or not? They could certainly do a better job by working with teams like llama.cpp, just like Qwen team did with Qwen 3 before releasing the model.

In 2024 I started playing with LLMs just before the release of Llama 3. I think Meta contributed a lot to this field and still contributing. Most LLM fine tuning tools are based on their models and also the inference tool llama.cpp has their name on it. The Llama 4 is fast and maybe not the greatest in real performance but still deserves respect. But my enthusiasm towards Llama models is probably because they rank highest on my AHA Leaderboard:

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

Looks like they did a worse job compared to Llama 3.1 this time. Llama 3.1 has been on top for a while.

Ranking high on my leaderboard is not correlated to technological progress or parameter size. In fact if LLM training is getting away from human alignment thanks to synthetic datasets or something else (?), it could be easily inversely correlated to technological progress. It seems there is a correlation regarding the location of the builders (in the West or East). Western models are ranking higher. This has become more visible as the leaderboard progressed, in the past there was less correlation. And Europeans seem to be in the middle!

Whether you like positive vibes from AI or not, maybe the times are getting closer where humans may be susceptible to being gamed by an AI? What do you think?

nyuuzyou

Apr 11

It's long been my view that LMArena isn't a fully reliable measure of real-world LLM performance. I suspect many users might click somewhat randomly, perhaps favoring answers based on superficial qualities like length, formatting, or speed, rather than deeper assessment.

Since all the Arena dialogues are publicly available on Hugging Face, a crowdsourced evaluation system utilizing that data seems like it could be quite valuable. It would also be interesting to see more development in automated evaluation systems, perhaps along the lines of "Arena-Hard-Auto" (though keeping such systems updated and robust is a challenge). However, building an effective automated evaluator would likely require training a specialized model on a large corpus, because I'm fairly certain that using a current powerful model like GPT-4-Turbo (or any other) for evaluation would introduce bias, favoring responses that align with its own style.

etemiz

Apr 11

I don't think it is too much random clicking. There is legitimacy to it.

I also think small portion of the data should be public. If any auditor wants, they can get a bigger portion of the data. LLM builders should not get all the data, thats for sure. I will try to do that for my leaderboard, a gradient of openness for different actors.

Smorty100

Apr 12

Where do u have that info about qwen 3 from? I'm really curious now...

etemiz

Apr 13

https://www.reddit.com/r/LocalLLaMA/comments/1jufqbn/qwen3_pull_request_sent_to_llamacpp/

In this post