Any news on the awesome new benchmarks?
Hey there, hope you're having a good week.
Been thinking about your plans for the new benchmarks since you posted, and man, a proper writing/reasoning test sounds like a game-changer. Just popping in to see how it's all coming along?
Totally get that you're probably swamped, so no rush at all, but any chance you'd be able to share a little sneak peek of what you've got so far? We're all super curious to see what you're cooking up.
Oh, and one other thing I was wondering once it's all ready to go, are you thinking of running the older models through the new tests too?
Anyway, keep up the amazing work. What you're building is seriously valuable for the community.
for real! I have been refreshing this every day hoping to see what has been cooking
I've been putting a lot of time into just trying to get vllm to run faster. I'm kinda stuck using unquantized bfloat16 models since every quantization method I try either doesn't have good model support, needs a certain gpu to be worth it, or is just slower than bfloat16.
So running larger models has become more difficult compared to when I just used ggufs through llama.cpp, but I wouldn't be able to do the quantity of prompts I can now without vllm's batching.
When I was brainstorming tasks to give llms to measure reasoning capabilities, I decided they needed to be something basically impossible to get a perfect score on - something so hard that it's more a question of how close the model gets rather than if it got it correct. Over time as models get smarter they'll know pretty much any trivia questions I ask them, so I'll probably move most questions over to this format.
Some examples are like "Here is a list of tv shows I've seen and the ratings out of 10 I've given each. Now here is another list of shows. Predict what I will rate each.", or a form of GeoGuessr where I give a description of a location and the llm has to use context clues to guess where I'm talking about (and I record how far off it was), or seeing which ai is the best at cooking by leaving out parts of a recipe and them having to say things like how many teaspoons of this ingredient I should use.
I've found though that reasoning models don't always do well at these kinds of questions. Reasoning models are more trained on things like math, coding, and logic puzzles. So I've decided to make these kinds of questions their own category. Probably going to call it the AI's world model. It's kinda like the model's practical common sense and understanding on how things relate to each other. Questions like estimate how much this object weights or how many views this youtube video will get. A lot of proprietary models do badly at some of these because their ingrained helpfulness and positivity makes them predict things like youtube videos getting more views than they actually end up getting. I'll make sure to have each of these benchmarks available by themselves too since things like knowing the best ai for show recommendations or cooking ability is more useful than just have a vague "intelligence" score.
Planning on having three segments that make up a model's intelligence: 1. standard intelligence questions like math, textbook knowledge, and logic, 2. pop culture questions, 3. world model questions. Models like qwq 32b and gpt-oss do well at the first category, but not as much the second two.
So yeah. I've kinda given up on getting the program to run any more efficiently and am now just gonna be testing a bunch of models before release. Model testing isn't super fast so I won't be able to test all the models I currently have on the leaderboard. I'll probably also try to implement a more organized model submission feature.
When will the leaderboard reopen?
When will the leaderboard reopen?
I've benchmarked around one hundred models under the new system so far, and I'd probably like that to be around three hundred when I reopen. I'll also need to recode the leaderboard display to handle all the new information. So maybe a couple weeks.
make sure to rebench a lot of the old ones so we know how the scores shift and why.