OpenEvals
community
AI & ML interests
LLM evaluation
Recent Activity
Articles
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 236 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 243 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K
-
110
Find a leaderboard
πExplore and discover all leaderboards from the HF community
-
40
YourBench
πGenerate custom evaluations from your data easily!
-
16
Example Leaderboard Template
π₯Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
124
Open-LLM performances are plateauing, letβs make the leaderboard steep again
πExplore and compare advanced language models on a new leaderboard
-
13.6k
Open LLM Leaderboard
πTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer β’ Updated β’ 4.58k β’ 9.42k β’ 20 -
open-llm-leaderboard/results
Preview β’ Updated β’ 7.41k β’ 15
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper β’ 2311.12983 β’ Published β’ 236 -
Zephyr: Direct Distillation of LM Alignment
Paper β’ 2310.16944 β’ Published β’ 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper β’ 2502.02737 β’ Published β’ 243 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper β’ 2412.03304 β’ Published β’ 21
-
110
Find a leaderboard
πExplore and discover all leaderboards from the HF community
-
40
YourBench
πGenerate custom evaluations from your data easily!
-
16
Example Leaderboard Template
π₯Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
π’Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
124
Open-LLM performances are plateauing, letβs make the leaderboard steep again
πExplore and compare advanced language models on a new leaderboard
-
13.6k
Open LLM Leaderboard
πTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer β’ Updated β’ 4.58k β’ 9.42k β’ 20 -
open-llm-leaderboard/results
Preview β’ Updated β’ 7.41k β’ 15
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K