Benchmarking LLMs for Political Science: A United Nations Perspective Paper • 2502.14122 • Published Feb 19 • 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval Paper • 2503.04644 • Published Mar 6 • 21
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation Paper • 2503.00812 • Published Mar 2
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content Paper • 2503.16031 • Published Mar 20 • 3
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper • 2504.13128 • Published 10 days ago • 5
Cost-of-Pass: An Economic Framework for Evaluating Language Models Paper • 2504.13359 • Published 10 days ago • 4