Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update May 16
Post
1319
πŸŽ‰ A new LLM is launched! πŸš€
After checking if it's open-source or not, πŸ€”
you rush to see the benchmarks... πŸƒβ€β™‚οΈπŸ’¨

Which benchmark does everyone check first? πŸ”

MMLU (Massive Multitask Language Understanding)? πŸ“š

Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! πŸŒβ—

Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! 🐯🌍

πŸ§ͺ 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans πŸ§‘β€πŸ”¬

πŸ”Ÿ Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing 🎯

πŸ“Š 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench πŸ“ˆ

πŸ€– LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations πŸ§ πŸ’‘

Any guess who tops it and who bombs it? πŸ€”πŸ“‰πŸ“ˆ

GPT-4o drops by 17% (from 0.887 to 0.7149) πŸ“‰
Llama-3-70B drops by 27% (from 0.820 to 0.5541) πŸ“‰

πŸ”— TIGER-Lab/MMLU-Pro

very cool! cc @clefourrier

Great post!