Open LLM Leaderboard 2
Track, rank and evaluate open LLMs and chatbots
Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...
Track, rank and evaluate open LLMs and chatbots
Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Note Specialized leaderboard for models with coding capabilities π₯οΈ (Evaluates on HumanEval and MultiPL-E)
Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Note Text Embeddings benchmark across 58 tasks and 112 languages!
Note A leaderboard for tool augmented LLMs!
Note An LLM leaderboard for Chinese models on many metric axes - super complete
Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Note An Open LLM Leaderboard specially for Dutch models!
Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Note Tests LLM API usage and calls (few models atm)
Note How likely is your LLM to help cyber attacks?
Note An aggregation of benchmarks well correlated with human preferences
Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Note Text to video generation leaderboard
Note Coding benchmark
Note An OCR benchmark
Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Note Red teaming datasets success against models
Note The Open LLM Leaderboard, but for structured state models!
Note A multimodal arena!
Track, rank and evaluate open LLMs in Portuguese
Note An LLM leaderboard for Portuguese
Track, rank and evaluate open LLMs in the italian language!
Note An LLM leaderboard for Italian
Note An LLM leaderboard for Malay
Realtime Image/Video Gen AI Arena
Note An arena for image generation!
Note An hallucination leaderboard, focused on a different set of tasks
VLMEvalKit Evaluation Results Collection
Vote on the latest TTS models!
Track, rank and evaluate open LLMs' CoT quality
Leaderboard for LLM for Science Reasoning
Track, rank and evaluate open LLMs and chatbots
Track, rank and evaluate open Arabic LLMs and chatbots
Compact LLM Battle Arena: Frugal AI Face-Off!
Evaluate open LLMs in the languages of LATAM and Spain.
GIFT-Eval: A Benchmark for General Time Series Forecasting