Leaderboards and benchmarks ✨

clefourrier 's Collections

LLM evaluation datasets

updated Feb 28

Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...

Upvote

113

Running on CPU Upgrade

13.6k

13.6k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots

Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Running

1.43k

1.43k

Big Code Models Leaderboard

📈

Submit code models for evaluation and view leaderboard

Note Specialized leaderboard for models with coding capabilities 🖥️ (Evaluates on HumanEval and MultiPL-E)
Running

4.63k

4.63k

LMArena Leaderboard

🏆

Display LMArena Leaderboard

Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Running

564

564

LLM-Perf Leaderboard

🏆

Explore hardware performance for LLMs

Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Paper • 2210.06413 • Published Oct 12, 2022

Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Holistic Evaluation of Language Models

Paper • 2211.09110 • Published Nov 16, 2022 • 1

Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 5

Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Running on CPU Upgrade

6.48k

6.48k

MTEB Leaderboard

🥇

Embedding Leaderboard

Note Text Embeddings benchmark across 58 tasks and 112 languages!
Running on CPU Upgrade

540

540

GAIA Leaderboard

🦾

Submit and evaluate models on GAIA leaderboard

Note A leaderboard for tool augmented LLMs!
Running

95

95

OpenCompass LLM Leaderboard

🚀

Display a web page

Note An LLM leaderboard for Chinese models on many metric axes - super complete
Runtime error

563

563

Open Ko-LLM Leaderboard

📉

Explore and filter language model benchmark results

Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Running

60

60

Hallucination Evaluation Leaderboard

⚡

Redirect to leaderboard page

Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Runtime error

144

144

Hallucinations Leaderboard

🔥

View and submit LLM evaluations

Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Running

93

93

Nexus Function Calling Leaderboard

🐠

Display benchmark results for models on various tasks

Note Tests LLM API usage and calls (few models atm)
Running

69

69

CyberSecEvalTest

📈

Evaluate LLMs' cybersecurity risks and capabilities

Note How likely is your LLM to help cyber attacks?
Running

191

191

Yet Another LLM Leaderboard

🌖

Generate interactive web apps with Streamlit

Note An aggregation of benchmarks well correlated with human preferences
Running on CPU Upgrade

94

94

LLM Safety Leaderboard

🥇

Explore and submit LLM benchmarks

Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Running

33

33

EvalCrafter

⚡

Display and filter video generation model leaderboard

Note Text to video generation leaderboard
Running

448

448

Can Ai Code Results

🏆

Can AI Code? An LLM leaderboard inclquantized models.

Note Coding benchmark
Running

192

192

Ocrbench Leaderboard

🏆

Display OCRBench leaderboard with model scores

Note An OCR benchmark
Running

53

53

NPHardEval Leaderboard

🥇

Explore and filter LLM benchmark results

Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Running

20

20

Subquadratic LLM Leaderboard

🏆

Submit models for evaluation on a leaderboard

Note The Open LLM Leaderboard, but for structured state models!
Running

558

558

Vision Arena (Testing VLMs side-by-side)

🖼

Display image analysis results

Note A multimodal arena!
Running

314

314

VBench Leaderboard

📊

Upload and evaluate video models
Running on CPU Upgrade

228

228

Open Portuguese LLM Leaderboard

🏆

Track, rank and evaluate open LLMs in Portuguese

Note An LLM leaderboard for Portuguese
Running on CPU Upgrade

79

79

Open Ita Llm Leaderboard

🏆

Track, rank and evaluate open LLMs in the italian language!

Note An LLM leaderboard for Italian
Running on Zero

292

292

GenAI Arena

📈

Realtime Image/Video Gen AI Arena

Note An arena for image generation!
Running

11

11

Q-Bench+ Leaderboard

📊

View leaderboard results for Q-Bench
Running on CPU Upgrade

34

34

Parti Prompts Leaderboard

📊

Display leaderboard for text-to-image model evaluations
Running on CPU Upgrade

171

171

LLM Hallucination Leaderboard

🚀

Generate visual data analysis plots

Note An hallucination leaderboard, focused on a different set of tasks
Running on CPU Upgrade

73

73

Open PL LLM Leaderboard

🏆

Display and filter LLM benchmark results
Running on CPU Upgrade

92

92

OpenLLM Turkish leaderboard

🥇

Explore and submit LLM benchmarks
Running

229

229

AI2 WildBench Leaderboard (V2)

🦁

Display and explore a leaderboard of language models
Running on CPU Upgrade

1.09k

1.09k

Open ASR Leaderboard

🏆

View and request speech recognition model benchmarks
Running on CPU Upgrade

896

896

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

399

399

Reward Bench Leaderboard

📐

Display and analyze reward model evaluation results
Running on CPU Upgrade

868

868

TTS Arena V2

🏆

Vote on the latest TTS models!
Running

17

17

Prompt Injection Detection Benchmark

📝

Check for prompt injection in text
Running

37

37

Long Code Arena

🏟

View and compare leaderboard results for coding tasks
Running

10

10

ML.ENERGY Leaderboard

⚡

Explore energy consumption of GenAI models
Running

1.11k

1.11k

UGI Leaderboard

📢

Uncensored General Intelligence Leaderboard
Running

117

117

Berkeley Function Calling Leaderboard

🏃

Display Berkeley Function-Calling Leaderboard
Running on CPU Upgrade

56

56

Open CoT Leaderboard

🥇

Track, rank and evaluate open LLMs' CoT quality
Running

22

22

URIAL Bench (Eval Base LLMs on MT-Bench)

🐑

Display a leaderboard of models
Running

24

24

Indic Llm Leaderboard

🔥

Explore and compare Indic LLMs on a leaderboard
Running

8

8

Meta Open LLM Leaderboard

🏆
Running

12

12

Science Leaderboard

👁

Leaderboard for LLM for Science Reasoning
Running on CPU Upgrade

427

427

Open Medical-LLM Leaderboard

🥇

Explore and submit models for benchmarking
Runtime error

28

28

Open RL Leaderboard

🥇
Running

20

20

LLM Leaderboard for SEA

🥇

Display and filter leaderboard data for language models
Running on CPU Upgrade

42

42

Hebrew LLM Leaderboard

🥇

Explore and submit LLM benchmarks
Runtime error

151

151

Open LLM Progress Tracker

🔬

Visualize Open vs. Proprietary LLM Progress
Running

181

181

Low-bit Quantized Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

73

73

AIR-Bench Leaderboard

🥇

Explore and compare QA and long doc benchmarks
Running on CPU Upgrade

166

166

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

123

123

Open Chinese LLM Leaderboard

🏆

Explore and submit LLM benchmarks
Running

347

347

3D Arena

🏢

Vote and view 3D leaderboard
Running

223

223

BigCodeBench Leaderboard

🥇

Explore and analyze code completion benchmarks
Running

22

22

Open Tw Llm Leaderboard

🥇

Explore and submit LLM benchmarks
Running

90

90

Zebra Logic Bench

🦓

Render a leaderboard for model evaluation
Running

102

102

Internal European Leaderboard

🌍

Explore multilingual LLM benchmark results
Running

23

23

🇨🇿 BenCzechMark

📊

Submit and track model performance on a leaderboard
Runtime error

52

52

Leaderboard

🥇

Browse and submit evaluation results for AI benchmarks
Running

51

51

Stick To Your Role! Leaderboard

🎭

Benchmarking LLMs on the stability of simulated populations
Running on CPU Upgrade

74

74

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.
Runtime error

40

40

OpenLLM French leaderboard 🇫🇷

🥇

Explore and submit LLM benchmarks
Running

125

125

GIFT Eval

🥇

GIFT-Eval: A Benchmark for General Time Series Forecasting
Running

107

107

Judge Arena

💻

Vote on AI responses to rank models
Running

78

78

Open Persian LLM Leaderboard

🏅

Open Persian LLM Leaderboard
Running

35

35

Japanese Chatbot Arena Leaderboard

🌖

Compare two chatbots and vote on the better one
Running on CPU Upgrade

97

97

Open Japanese LLM Leaderboard

🌸

Explore and compare LLM models with interactive filters and visualizations
Running

8

8

Leaderboard2024

🏅

Submit models to MLSB 2024 leaderboard
Runtime error

11

11

Toxicity Benchmarking

🥇

Explore toxicity scores of models
Running

73

73

Background Removal Arena

⚡

Vote for the best background removal model
Running

28

28

fev leaderboard

🚀

Forecasting evaluation benchmark
Running

74

74

AI Phone Leaderboard

📱

AI Phone Leaderboard
Running

12

12

Polish EQ-Bench Leaderboard

🏆

Display and analyze model benchmark results
Running on CPU Upgrade

8

8

Polish Medical Leaderboard

🇵

Display and filter LLM benchmark results
Running

9

9

CPTU-Bench

🧠

Explore Polish text understanding benchmark results
Paused

28

28

MT Bench PL

📊

Browse and evaluate model answers and comparisons
Running on CPU Upgrade

75

75

DABstep Leaderboard

🕺

DABstep Reasoning Benchmark Leaderboard

Upvote

113

Leaderboards and benchmarks ✨

Open LLM Leaderboard

Big Code Models Leaderboard

LMArena Leaderboard

LLM-Perf Leaderboard

MTEB Leaderboard

GAIA Leaderboard

OpenCompass LLM Leaderboard

Open Ko-LLM Leaderboard

Hallucination Evaluation Leaderboard

Hallucinations Leaderboard

Nexus Function Calling Leaderboard

CyberSecEvalTest

Yet Another LLM Leaderboard

LLM Safety Leaderboard

EvalCrafter

Can Ai Code Results

Ocrbench Leaderboard

NPHardEval Leaderboard

Subquadratic LLM Leaderboard

Vision Arena (Testing VLMs side-by-side)

VBench Leaderboard

Open Portuguese LLM Leaderboard

Open Ita Llm Leaderboard

GenAI Arena

Q-Bench+ Leaderboard

Parti Prompts Leaderboard

LLM Hallucination Leaderboard

Open PL LLM Leaderboard

OpenLLM Turkish leaderboard

AI2 WildBench Leaderboard (V2)

Open ASR Leaderboard

Open VLM Leaderboard

Reward Bench Leaderboard

TTS Arena V2

Prompt Injection Detection Benchmark

Long Code Arena

ML.ENERGY Leaderboard

UGI Leaderboard

Berkeley Function Calling Leaderboard

Open CoT Leaderboard

URIAL Bench (Eval Base LLMs on MT-Bench)

Indic Llm Leaderboard

Meta Open LLM Leaderboard

Science Leaderboard

Open Medical-LLM Leaderboard

Open RL Leaderboard

LLM Leaderboard for SEA

Hebrew LLM Leaderboard

Open LLM Progress Tracker

Low-bit Quantized Open LLM Leaderboard

AIR-Bench Leaderboard

Open Arabic LLM Leaderboard

Open Chinese LLM Leaderboard

3D Arena

BigCodeBench Leaderboard

Open Tw Llm Leaderboard

Zebra Logic Bench

Internal European Leaderboard

🇨🇿 BenCzechMark

Leaderboard

Stick To Your Role! Leaderboard

La Leaderboard

OpenLLM French leaderboard 🇫🇷

GIFT Eval

Judge Arena

Open Persian LLM Leaderboard

Japanese Chatbot Arena Leaderboard

Open Japanese LLM Leaderboard

Leaderboard2024

Toxicity Benchmarking

Background Removal Arena

fev leaderboard

AI Phone Leaderboard

Polish EQ-Bench Leaderboard

Polish Medical Leaderboard

CPTU-Bench

MT Bench PL

DABstep Leaderboard