Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
Abstract
Using cascading LLM prompts to identify and relabel false negatives in datasets improves retrieval and reranking models' performance.
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
Community
Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. Great job @nthakur , @crystina-z , @MrLight & @lintool
See the organization with datasets & models here: https://huggingface.co/rlhn
- Tom Aarsen
Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? ๐ก
In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling ๐๐๐ฅ๐ฌ๐-๐ง๐๐ ๐๐ญ๐ข๐ฏ๐๐ฌ!
Preprint: https://arxiv.org/abs/2505.16967
๐๐๐ซ๐๐ฅ๐ข๐ฆ๐ข๐ง๐๐ซ๐ฒ
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! ๐คฏ
๐ ๐๐๐ญ๐๐ฌ๐๐ญ ๐๐ซ๐ฎ๐ง๐ข๐ง๐
1๏ธโฃ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2๏ธโฃ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3๏ธโฃ This shows that some datasets are harmful to model performance.
๐ ๐
๐๐ฅ๐ฌ๐ ๐๐๐ ๐๐ญ๐ข๐ฏ๐๐ฌ
In pruned training datasets, we observe a common issue of "false negatives": where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (๐๐๐๐) to identify and relabel these false negatives in training datasets.
We carefully measure three operations with identified false negatives in training pairs:
1๏ธโฃ Remove: Discard the training pair completely with a false negative.
2๏ธโฃ HN Remove: Discard only the false negatives from the list of hard negatives
3๏ธโฃ ๐๐๐๐: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.
๐ ๐๐ฑ๐ฉ๐๐ซ๐ข๐ฆ๐๐ง๐ญ๐๐ฅ ๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ
๐๐๐๐ gains the best improvement in retrievers and rerankers in contrast to other approaches. ๐๐๐๐ starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.
We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.
Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG (2025)
- FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents (2025)
- CRAFT: Training-Free Cascaded Retrieval for Tabular QA (2025)
- Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance (2025)
- Augmented Relevance Datasets with Fine-Tuned Small LLMs (2025)
- ReasonIR: Training Retrievers for Reasoning Tasks (2025)
- Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper