StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs Paper β’ 2505.20139 β’ Published May 26 β’ 18
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research Paper β’ 2505.19955 β’ Published May 26 β’ 12
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design Paper β’ 2505.16175 β’ Published May 22 β’ 42
General-Reasoner: Advancing LLM Reasoning Across All Domains Paper β’ 2505.14652 β’ Published May 20 β’ 23
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation Paper β’ 2504.00043 β’ Published Mar 30 β’ 10
Small Models Struggle to Learn from Strong Reasoners Paper β’ 2502.12143 β’ Published Feb 17 β’ 40
ACECODER: Acing Coder RL via Automated Test-Case Synthesis Paper β’ 2502.01718 β’ Published Feb 3 β’ 29
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper β’ 2502.01100 β’ Published Feb 3 β’ 18
Running 556 556 Vision Arena (Testing VLMs side-by-side) πΌ Analyze images to detect and label objects
On Memorization of Large Language Models in Logical Reasoning Paper β’ 2410.23123 β’ Published Oct 30, 2024 β’ 18
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper β’ 2410.10563 β’ Published Oct 14, 2024 β’ 39
Running 556 556 Vision Arena (Testing VLMs side-by-side) πΌ Analyze images to detect and label objects
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies Paper β’ 2308.03188 β’ Published Aug 6, 2023 β’ 2
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Paper β’ 2305.13903 β’ Published May 23, 2023
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings Paper β’ 2305.02317 β’ Published May 3, 2023