Open LLM Leaderboard
Track, rank and evaluate open LLMs and chatbots
Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...
Track, rank and evaluate open LLMs and chatbots
Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Submit code models for evaluation and view leaderboard
Note Specialized leaderboard for models with coding capabilities π₯οΈ (Evaluates on HumanEval and MultiPL-E)
Display LMArena Leaderboard
Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Explore hardware performance for LLMs
Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Embedding Leaderboard
Note Text Embeddings benchmark across 58 tasks and 112 languages!
Submit and evaluate models on GAIA leaderboard
Note A leaderboard for tool augmented LLMs!
Display a web page
Note An LLM leaderboard for Chinese models on many metric axes - super complete
Explore and filter language model benchmark results
Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Redirect to leaderboard page
Note A leaderboard to evaluate the propensy of LLMs to hallucinate
View and submit LLM evaluations
Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Display benchmark results for models on various tasks
Note Tests LLM API usage and calls (few models atm)
Evaluate LLMs' cybersecurity risks and capabilities
Note How likely is your LLM to help cyber attacks?
Generate interactive web apps with Streamlit
Note An aggregation of benchmarks well correlated with human preferences
Explore and submit LLM benchmarks
Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Display and filter video generation model leaderboard
Note Text to video generation leaderboard
Can AI Code? An LLM leaderboard inclquantized models.
Note Coding benchmark
Display OCRBench leaderboard with model scores
Note An OCR benchmark
Explore and filter LLM benchmark results
Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Submit models for evaluation on a leaderboard
Note The Open LLM Leaderboard, but for structured state models!
Display image analysis results
Note A multimodal arena!
Upload and evaluate video models
Track, rank and evaluate open LLMs in Portuguese
Note An LLM leaderboard for Portuguese
Track, rank and evaluate open LLMs in the italian language!
Note An LLM leaderboard for Italian
Realtime Image/Video Gen AI Arena
Note An arena for image generation!
View leaderboard results for Q-Bench
Display leaderboard for text-to-image model evaluations
Generate visual data analysis plots
Note An hallucination leaderboard, focused on a different set of tasks
Display and filter LLM benchmark results
Explore and submit LLM benchmarks
Display and explore a leaderboard of language models
View and request speech recognition model benchmarks
VLMEvalKit Evaluation Results Collection
Display and analyze reward model evaluation results
Vote on the latest TTS models!
Check for prompt injection in text
View and compare leaderboard results for coding tasks
Explore energy consumption of GenAI models
Uncensored General Intelligence Leaderboard
Display Berkeley Function-Calling Leaderboard
Track, rank and evaluate open LLMs' CoT quality
Display a leaderboard of models
Explore and compare Indic LLMs on a leaderboard
Leaderboard for LLM for Science Reasoning
Explore and submit models for benchmarking
Display and filter leaderboard data for language models
Explore and submit LLM benchmarks
Visualize Open vs. Proprietary LLM Progress
Track, rank and evaluate open LLMs and chatbots
Explore and compare QA and long doc benchmarks
Track, rank and evaluate open Arabic LLMs and chatbots
Explore and submit LLM benchmarks
Vote and view 3D leaderboard
Explore and analyze code completion benchmarks
Explore and submit LLM benchmarks
Render a leaderboard for model evaluation
Explore multilingual LLM benchmark results
Submit and track model performance on a leaderboard
Browse and submit evaluation results for AI benchmarks
Benchmarking LLMs on the stability of simulated populations
Evaluate open LLMs in the languages of LATAM and Spain.
Explore and submit LLM benchmarks
GIFT-Eval: A Benchmark for General Time Series Forecasting
Vote on AI responses to rank models
Open Persian LLM Leaderboard
Compare two chatbots and vote on the better one
Explore and compare LLM models with interactive filters and visualizations
Submit models to MLSB 2024 leaderboard
Explore toxicity scores of models
Vote for the best background removal model
Forecasting evaluation benchmark
AI Phone Leaderboard
Display and analyze model benchmark results
Display and filter LLM benchmark results
Explore Polish text understanding benchmark results
Browse and evaluate model answers and comparisons
DABstep Reasoning Benchmark Leaderboard