📊 Benchmarks and Leaderboards - a society-ethics Collection

society-ethics 's Collections

⛔️🔦 Provenance, Watermarking & Deepfake Detection

🗳️ AI for Policymakers

⚖️ Showing Biases in ML Systems

🤬⛔ Hate Speech and Filtering

🪪🔦Model Cards

🔒☂️🧑‍🤝‍🧑 Privacy and AI

📊 Benchmarks and Leaderboards

📚🔍 Understanding Datasets

💻🔍 Understanding Models

🏛️📚🖼️ Open Data: Public Domain and Open Licenses

📊 Benchmarks and Leaderboards

updated Sep 26, 2024

Running on CPU Upgrade

13.2k

13.2k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots
Runtime error

5

5

Zeno Evals Hub

🏃
Running on CPU Upgrade

5.87k

5.87k

MTEB Leaderboard

🥇

Embedding Leaderboard
Running

516

516

LLM-Perf Leaderboard

🏆

Explore LLM performance across hardware
Runtime error

135

135

Leaderboards

📈
Running on CPU Upgrade

876

876

Open ASR Leaderboard

🏆

Request evaluation for a speech model
Running

1.35k

1.35k

Big Code Models Leaderboard

📈

Search and submit code models for evaluation
Running

4.47k

4.47k

Chatbot Arena Leaderboard

🏆

Display chatbot leaderboard and stats
Running

162

162

Open Object Detection Leaderboard

🏆

Request model evaluation on COCO val 2017 dataset
Running

67

67

Toolbench Leaderboard

⚡

Display ToolBench model performance results
Running

85

85

SEED-Bench Leaderboard

🏆
Running

95

95

OpenCompass LLM Leaderboard

🚀

Display a web page
nguha/legalbench

Updated Sep 30, 2024 • 9.54k • 124
Running

6

6

Skillmix

🚀

Browse and compare AI model evaluations
Running on CPU Upgrade

140

140

Hallucinations Leaderboard

🔥

View and submit LLM evaluations
Running

38

38

MVBench Leaderboard

🐨

Submit model evaluation and view leaderboard
Running

3

3

Mt Bench French Browser

📊
Running

8

8

ML.ENERGY Leaderboard

⚡

Explore energy consumption of GenAI models
Running

53

53

NPHardEval Leaderboard

🥇

Explore and compare LLM models through a leaderboard
Running

287

287

VBench Leaderboard

📊

Upload and analyze video model evaluation data
Runtime error

105

105

Enterprise Scenarios Leaderboard

🥇
Running

189

189

Yet Another LLM Leaderboard

🌖

Run a Streamlit web app
Running

66

66

CyberSecEvalTest

📈

Evaluate LLM cybersecurity risks
Runtime error

30

30

Contextual Leaderboard

🐨
Running

56

56

Open Multilingual Llm Leaderboard

🐨

Search for model performance across languages and benchmarks
Running on CPU Upgrade

91

91

OpenLLM Turkish leaderboard

🥇

Browse and filter leaderboard of language models
Running on CPU Upgrade

794

794

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

379

379

Reward Bench Leaderboard

📐

Display and filter reward model evaluation data
Runtime error

63

63

Guardrails Arena

⚔

Jailbreak the LLM and privacy guardrails
Running

16

16

🐍💨 Data Contamination Database

🏭

Filter data for contamination in datasets or models
Running on CPU Upgrade

153

153

Open Arabic LLM Leaderboard

🏆

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

72

72

AIR-Bench Leaderboard

🥇

Explore and compare QA and long doc benchmarks
Running

23

23

MM-UPD Leaderboard

🥇

Submit and evaluate model results for the MM-AAD leaderboard
Running

212

212

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data
Running on CPU Upgrade

73

73

La Leaderboard

🌸

Evaluate open LLMs in the languages of LATAM and Spain.