Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
etemiz 
posted an update 8 days ago
Post
1065
Qwen 3 numbers are in! They did a good job this time, compared to 2.5 and QwQ numbers are a lot better.

I used 2 GGUFs for this, one from LMStudio and one from Unsloth. Number of parameters: 235B A22B. The first one is Q4. Second one is Q8.

The LLMs that did the comparison are the same, Llama 3.1 70B and Gemma 3 27B.

So I took 2*2 = 4 measurements for each column and took average of measurements.

My leaderboard is pretty unrelated to others it seems. Valuable in that sense, it is another non-mainstream angle for model evaluation.

More info: https://huggingface.co/blog/etemiz/aha-leaderboard

I think my leaderboard can be used for p(doom)!

Lets say highest scores around 50 corresponds to p(doom) = 0.1
And say lowest scores around 20 corresponds to p(doom) = 0.5

Last three models that I measured are Grok 3, Llama 4 Maverick and Qwen 3. Scores are 42, 45, 41. So based on last 3 measurements average is 42.66. Mapping this to the scale above between 20 and 50:

(50-42.66)/(50-20)=0.24

mapping this to the probability domain:

(0.5-0.1)*0.24 + 0.1=0.196

So probability of doom is ~20%

If models are released that score high in my leaderboard, p(doom) will reduce. If models are released that score low in my leaderboard, p(doom) will increase.

In this post