kaizuberbuehler
's Collections
Benchmarks
updated
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
187
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
•
2311.16502
•
Published
•
35
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
25
RULER: What's the Real Context Size of Your Long-Context Language
Models?
Paper
•
2404.06654
•
Published
•
35
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
•
2404.03820
•
Published
•
25
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
•
2404.03543
•
Published
•
16
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and
Human Ratings
Paper
•
2404.16820
•
Published
•
16
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
8
On the Planning Abilities of Large Language Models -- A Critical
Investigation
Paper
•
2305.15771
•
Published
•
1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
22
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper
•
2406.09170
•
Published
•
26
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
19
CS-Bench: A Comprehensive Benchmark for Large Language Models towards
Computer Science Mastery
Paper
•
2406.08587
•
Published
•
15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
62
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
55
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
53
BABILong: Testing the Limits of LLMs with Long Context
Reasoning-in-a-Haystack
Paper
•
2406.10149
•
Published
•
49
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
•
2407.18961
•
Published
•
40
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
•
2407.18901
•
Published
•
33
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper
•
2307.13854
•
Published
•
24
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
•
2408.03361
•
Published
•
86
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Paper
•
2408.14354
•
Published
•
41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated
clinical environments
Paper
•
2405.07960
•
Published
•
1
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft
Reasoning
Paper
•
2310.16049
•
Published
•
4
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
Search Engines
Paper
•
2409.12959
•
Published
•
37
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
•
2409.07703
•
Published
•
67
HelloBench: Evaluating Long Text Generation Capabilities of Large
Language Models
Paper
•
2409.16191
•
Published
•
42
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
•
2409.15272
•
Published
•
27
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
Language Models
Paper
•
2410.07985
•
Published
•
28
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
47
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
•
2412.15204
•
Published
•
33
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
•
2412.14161
•
Published
•
50
Are Your LLMs Capable of Stable Reasoning?
Paper
•
2412.13147
•
Published
•
91
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
53
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
•
2501.01257
•
Published
•
47
A3: Android Agent Arena for Mobile GUI Agents
Paper
•
2501.01149
•
Published
•
22
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on
Self-invoking Code Generation
Paper
•
2412.21199
•
Published
•
12
ResearchTown: Simulator of Human Research Community
Paper
•
2412.17767
•
Published
•
13
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Paper
•
2412.14470
•
Published
•
12
The BrowserGym Ecosystem for Web Agent Research
Paper
•
2412.05467
•
Published
•
19
Evaluating Language Models as Synthetic Data Generators
Paper
•
2412.03679
•
Published
•
46
Paper
•
2412.04315
•
Published
•
17
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
in LLMs
Paper
•
2412.03205
•
Published
•
16
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
45
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
•
2411.04075
•
Published
•
16
From Medprompt to o1: Exploration of Run-Time Strategies for Medical
Challenge Problems and Beyond
Paper
•
2411.03590
•
Published
•
10
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
•
2501.04686
•
Published
•
48
SOTOPIA: Interactive Evaluation for Social Intelligence in Language
Agents
Paper
•
2310.11667
•
Published
•
3
PokerBench: Training Large Language Models to become Professional Poker
Players
Paper
•
2501.08328
•
Published
•
13
WebWalker: Benchmarking LLMs in Web Traversal
Paper
•
2501.07572
•
Published
•
18
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
•
2501.05510
•
Published
•
35
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
•
2501.04003
•
Published
•
23
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
•
2501.08828
•
Published
•
26
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
•
2501.09012
•
Published
•
10