Running on CPU Upgrade 12.2k π Open LLM Leaderboard Track, rank and evaluate open LLMs and chatbots
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper β’ 2307.13854 β’ Published Jul 25, 2023 β’ 24
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java Paper β’ 2408.14354 β’ Published Aug 26, 2024 β’ 41
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper β’ 2404.07972 β’ Published Apr 11, 2024 β’ 47
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Paper β’ 2412.14161 β’ Published about 1 month ago β’ 50
MiniMax-01: Scaling Foundation Models with Lightning Attention Paper β’ 2501.08313 β’ Published 4 days ago β’ 258