When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17 • 9
CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents Paper • 2306.10376 • Published Jun 17, 2023
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! Paper • 2410.01023 • Published Oct 1, 2024 • 2
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics Paper • 2406.14703 • Published Jun 20, 2024 • 2
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms Paper • 2503.14427 • Published Mar 18 • 19