BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Paper • 2411.13543 • Published Nov 20, 2024 • 18
When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options Paper • 2409.00113 • Published Aug 27, 2024 • 2