A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility Paper • 2504.07086 • Published Apr 9 • 21
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published Dec 9, 2024 • 6
Data Contamination Report from the 2024 CONDA Shared Task Paper • 2407.21530 • Published Jul 31, 2024 • 10
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4, 2024 • 30
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress Paper • 2402.19472 • Published Feb 29, 2024 • 2
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models Paper • 2310.08577 • Published Oct 12, 2023 • 1
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models Paper • 2211.16198 • Published Nov 28, 2022 • 1