AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories Paper • 2504.08942 • Published Apr 11 • 27
How to Get Your LLM to Generate Challenging Problems for Evaluation Paper • 2502.14678 • Published Feb 20 • 18
Are NLP Models really able to Solve Simple Math Word Problems? Paper • 2103.07191 • Published Mar 12, 2021 • 1
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions Paper • 2310.03016 • Published Oct 4, 2023 • 2