Abstract
BlockRank optimizes in-context ranking by enforcing inter-document block sparsity and enhancing query-document relevance, improving efficiency and scalability in large-scale information retrieval.
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.
Community
We present, “Scalable In-context Ranking with Generative Models” — step toward retrieval-native LLMs — models that understand & optimize retrieval internally, rather than as an external prompt-level task.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Contrastive Retrieval Heads Improve Attention-Based Re-Ranking (2025)
- Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision (2025)
- Upcycling Candidate Tokens of Large Language Models for Query Expansion (2025)
- Multi-view-guided Passage Reranking with Large Language Models (2025)
- From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation (2025)
- How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models (2025)
- Training LLMs to be Better Text Embedders through Bidirectional Reconstruction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi! First off, thank you for your excellent work—it’s been really helpful for our research. Could you please let us know if there’s a timeline for releasing the code and model weights? We’d greatly appreciate an update whenever you have a chance!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper