Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering Paper • 2503.14996 • Published Mar 19 • 1
ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering Paper • 2410.05077 • Published Oct 7, 2024 • 3
Reward Bench 2 Collection Datasets, spaces, and models for Reward Bench 2 benchmark and paper! • 11 items • Updated Jun 3 • 13
CLIPPER Collection Models and datasets for CLIPPER: Compression enables long-context synthetic data generation • 6 items • Updated Feb 20 • 5