Reward Bench 2 Collection Datasets, spaces, and models for Reward Bench 2 benchmark and paper! • 11 items • Updated 12 days ago • 11
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs Paper • 2505.19457 • Published 20 days ago • 61
view article Article TinyAgents: A Minimal Experiment with Code Agents and MCP Tools By albertvillanova • 30 days ago • 29
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially? Paper • 2503.12349 • Published Mar 16 • 41
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools Paper • 2503.10970 • Published Mar 14 • 17
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving Paper • 2502.16111 • Published Feb 22 • 9
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning Paper • 2502.14768 • Published Feb 20 • 48
view article Article NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates By lizhouf and 3 others • Feb 2, 2024 • 4
view article Article Agent Leaderboard: Evaluating AI Agents in Multi-Domain Scenarios By pratikbhavsar and 1 other • Feb 12 • 22
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling Paper • 2502.06703 • Published Feb 10 • 153
Efficient Tool Use with Chain-of-Abstraction Reasoning Paper • 2401.17464 • Published Jan 30, 2024 • 21
Training Language Model Agents without Modifying Language Models Paper • 2402.11359 • Published Feb 17, 2024 • 2
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments Paper • 2402.14672 • Published Feb 22, 2024 • 1
view article Article π0 and π0-FAST: Vision-Language-Action Models for General Robot Control By danaaubakirova and 3 others • Feb 4 • 162
The Lessons of Developing Process Reward Models in Mathematical Reasoning Paper • 2501.07301 • Published Jan 13 • 99
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10 • 66
DynaSaur: Large Language Agents Beyond Predefined Actions Paper • 2411.01747 • Published Nov 4, 2024 • 36