LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model Paper • 2509.00676 • Published 9 days ago • 76
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Paper • 2506.21506 • Published Jun 26 • 51
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories Paper • 2504.08942 • Published Apr 11 • 27
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills Paper • 2504.07079 • Published Apr 9 • 11
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents Paper • 2411.06559 • Published Nov 10, 2024 • 15
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Paper • 2410.05243 • Published Oct 7, 2024 • 19
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions Paper • 2403.19651 • Published Mar 28, 2024 • 23
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 36