223
MMLU-Pro Leaderboard
🥇
More advanced and challenging multi-task evaluation
More advanced and challenging multi-task evaluation
Benchmarking LLMs on the stability of simulated populations
Embed and use ZeroEval for evaluation tasks
Display model leaderboard evaluations
Browse and submit LLM evaluations
VLMEvalKit Eval Results in video understanding benchmark
Track, rank and evaluate open LLMs and chatbots
Blind vote on HF TTS models!