FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents Paper • 2504.13128 • Published Apr 17 • 5
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses Paper • 2504.20006 • Published Apr 28
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval Paper • 2505.16967 • Published May 22 • 23
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems Paper • 2410.13716 • Published Oct 17, 2024
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track Paper • 2406.16828 • Published Jun 24, 2024
Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard Paper • 2306.07471 • Published Jun 13, 2023
NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation Paper • 2312.11361 • Published Dec 18, 2023 • 1
HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution Paper • 2307.16883 • Published Jul 31, 2023
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks Paper • 2010.08240 • Published Oct 16, 2020
GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval Paper • 2112.07577 • Published Dec 14, 2021
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages Paper • 2210.09984 • Published Oct 18, 2022 • 2
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models Paper • 2104.08663 • Published Apr 17, 2021 • 3
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval Paper • 2311.05800 • Published Nov 10, 2023 • 4