A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility Paper • 2504.07086 • Published Apr 9 • 21
A Practitioner's Guide to Continual Multimodal Pretraining Paper • 2408.14471 • Published Aug 26, 2024
CiteME: Can Language Models Accurately Cite Scientific Claims? Paper • 2407.12861 • Published Jul 10, 2024
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26 • 20
Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs Paper • 2502.19413 • Published Feb 26 • 20
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published Dec 9, 2024 • 6
Data Contamination Report from the 2024 CONDA Shared Task Paper • 2407.21530 • Published Jul 31, 2024 • 10
Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry Paper • 2404.06405 • Published Apr 9, 2024 • 2
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4, 2024 • 30
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress Paper • 2402.19472 • Published Feb 29, 2024 • 2
Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? Paper • 2305.09275 • Published May 16, 2023 • 1
Online Continual Learning Without the Storage Constraint Paper • 2305.09253 • Published May 16, 2023 • 2