arxiv:2505.16967

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Published on May 22

· Submitted by

Authors:

Abstract

Using cascading LLM prompts to identify and relabel false negatives in datasets improves retrieval and reranking models' performance.

AI-generated summary

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

View arXiv page View PDF GitHub repository Add to collection

Community

tomaarsen

1 day ago

Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. Great job @nthakur , @crystina-z , @MrLight & @lintool

See the organization with datasets & models here: https://huggingface.co/rlhn

Tom Aarsen

nthakur

Paper submitter 1 day ago

Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? 😡

In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling 𝐟𝐚𝐥𝐬𝐞-𝐧𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬!

Preprint: https://arxiv.org/abs/2505.16967

🌟𝐏𝐫𝐞𝐥𝐢𝐦𝐢𝐧𝐚𝐫𝐲
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! 🤯

🚀 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐏𝐫𝐮𝐧𝐢𝐧𝐠
1️⃣ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2️⃣ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3️⃣ This shows that some datasets are harmful to model performance.

📊 𝐅𝐚𝐥𝐬𝐞 𝐍𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬
In pruned training datasets, we observe a common issue of "false negatives": where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (𝐑𝐋𝐇𝐍) to identify and relabel these false negatives in training datasets.

We carefully measure three operations with identified false negatives in training pairs:
1️⃣ Remove: Discard the training pair completely with a false negative.
2️⃣ HN Remove: Discard only the false negatives from the list of hard negatives
3️⃣ 𝐑𝐋𝐇𝐍: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.

📊 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐚𝐥 𝐑𝐞𝐬𝐮𝐥𝐭𝐬
𝐑𝐋𝐇𝐍 gains the best improvement in retrievers and rerankers in contrast to other approaches. 𝐑𝐋𝐇𝐍 starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.

We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.

Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn