Benchmarking LLMs for Political Science: A United Nations Perspective Paper • 2502.14122 • Published 18 days ago • 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval Paper • 2503.04644 • Published 4 days ago • 20
ExpertGenQA: Open-ended QA generation in Specialized Domains Paper • 2503.02948 • Published 6 days ago
Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation Paper • 2503.00812 • Published 8 days ago