AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant Paper • 2410.18603 • Published Oct 24, 2024 • 33
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 103
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations Paper • 2506.13651 • Published Jun 16 • 9
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents Paper • 2507.19478 • Published 26 days ago • 29
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? Paper • 2507.12415 • Published Jul 16 • 41
ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention Paper • 2507.01004 • Published Jul 1 • 10
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling Paper • 2506.20512 • Published Jun 25 • 46
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Paper • 2506.14965 • Published Jun 17 • 49
General-Reasoner: Advancing LLM Reasoning Across All Domains Paper • 2505.14652 • Published May 20 • 23
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19 • 46
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19 • 46
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning Paper • 2504.18904 • Published Apr 26 • 9
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows Paper • 2411.07763 • Published Nov 12, 2024 • 2