The Past and Present of Sparse Retrieval

Community Article Published October 4, 2025

Sparse retrieval has quietly powered search for decades, and it’s having a renaissance alongside dense embeddings. This article walks through where sparse methods came from, how they evolved, and why they still matter.

1. TF-IDF, BM25, Indexing

1-1. TF-IDF

TF-IDF (Term Frequency–Inverse Document Frequency) assigns importance to words using their Term Frequency (TF) and Inverse Document Frequency (IDF). Its components can be summarized as follows:

  • TF(t, d) = (number of times term t appears in document d) / (total number of terms in d)
  • DF(t) = number of documents that contain term t
  • IDF(t, D) = ln( total number of documents / (1 + DF(t)) ) = ln( |D| / (1 + DF(t)) )
  • TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

In other words, a term that appears frequently in a particular document but not across the entire collection will receive a high TF-IDF score. For example, “Australopithecus” would likely get a high score, while demonstratives like “this”, “that”, or filler particles would receive low scores because they tend to occur in many documents.

1-2. BM25

BM25 is a refinement of TF-IDF. As you can see with TF-IDF, the term frequency tends to grow as a document gets longer, which means very long documents can receive disproportionately large scores. BM25 corrects for this by adjusting the score using both the document length and the average document length, so long documents don’t get unfairly high scores.

Another issue with plain TF-IDF is that a term’s contribution increases linearly with each repetition in a document (e.g. “APT APT APT APT …”). BM25 counteracts this by saturating the term-frequency effect—beyond a certain point, repeating a word adds little additional importance (a smoothing effect).

Incorporating these two ideas, the BM25 scoring function is:

BM25(D,Q)=qiQIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1 ⁣(1b+bDavgdl) \text{BM25}(D, Q) = \sum_{q_i \in Q} \operatorname{IDF}(q_i)\cdot \frac{f(q_i, D)\,(k_1 + 1)} {f(q_i, D) + k_1\!\left(1 - b + b\,\frac{|D|}{\mathrm{avgdl}}\right)}

  • The left factor is BM25’s IDF, and the right fraction is BM25’s TF component.

Looking at the right-hand side first: f(q_i, D) is the usual term frequency of q_i in document D. BM25 introduces a hyperparameter k_1, which controls how much extra credit a term gets for appearing one more time. In TF-IDF, the TF part is simply f(q_i, D): as f(q_i, D) increases, the score grows linearly. In BM25, even if f(q_i, D) is large, a small k_1 dampens the gain from repetition, preventing runaway growth.

For the IDF term, BM25 commonly uses:

IDF(qi)=ln ⁣(Nn(qi)+0.5n(qi)+0.5) \operatorname{IDF}(q_i) = \ln\!\left(\frac{N - n(q_i) + 0.5}{\,n(q_i) + 0.5}\right)

where N is the total number of documents and n(q_i) is the number of documents containing q_i. Compared to the TF-IDF variant you may have seen, this form more aggressively downweights frequent terms and upweights rare terms. When n(q_i) is small (the term appears in few documents), the numerator gets larger and the denominator smaller, so IDF increases. Conversely, when n(q_i) is large, IDF decreases. (The +0.5 offsets guard against zero counts.)

A quick numeric check (ignoring the length-normalization term to isolate k_1’s effect). If k_1 = 0.5:

  • For f(q_i, D) = 10, 10(0.5+1)10+0.5=101.510.51.43 \frac{10\cdot(0.5 + 1)}{10 + 0.5} = \frac{10\cdot 1.5}{10.5} \approx 1.43
  • For f(q_i, D) = 1000, 1000(0.5+1)1000+0.5=10001.51000.51.50 \frac{1000\cdot(0.5 + 1)}{1000 + 0.5} = \frac{1000\cdot 1.5}{1000.5} \approx 1.50

In TF-IDF those two would differ by 100×, but in BM25 they’re only about 1.x× apart - showing how BM25 curbs the effect of raw repetition.

Finally, in the TF denominator you’ll notice the factor

1b+bDavgdl. 1 - b + b\,\frac{|D|}{\mathrm{avgdl}} \, .

This is the length normalization term. If |D| > avgdl, that factor grows, the denominator increases, and the overall BM25 contribution shrinks. This prevents long documents from scoring higher merely because they are long.

1-3. Indexing

BM25 is a ranking function (a scoring method) for evaluating a document’s relevance to a query. In practice, search begins with an indexing phase: all documents are tokenized, corpus statistics are collected, and an inverted index is built.

An inverted index is a data structure designed for fast, efficient search. Instead of asking “which words appear in this document?”, it lets us ask “which documents contain this word?” quickly. Concretely, a forward index uses doc_id as the key and stores {term: TF} as the value. An inverted index flips that: it uses the term as the key and stores {doc_id: TF} as the value.

For example, given the following toy corpus:

Doc ID Content
doc1 "apple banana cherry"
doc2 "banana cherry date"
doc3 "cherry date apple"
  1. Forward Index (per document, store term counts)
  • doc1: {"apple": 1, "banana": 1, "cherry": 1}
  • doc2: {"banana": 1, "cherry": 1, "date": 1}
  • doc3: {"cherry": 1, "date": 1, "apple": 1}
  1. Inverted Index (per term, store postings of documents and counts)
  • apple: {"doc1": 1, "doc3": 1}
  • banana: {"doc1": 1, "doc2": 1}
  • cherry: {"doc1": 1, "doc2": 1, "doc3": 1}
  • date: {"doc2": 1, "doc3": 1}

With this inverted index, a search for the term "banana" immediately tells us which documents contain it and how many times.

After the inverted index is built, at query time the query is tokenized. For each query token, we use the postings in the inverted index to compute the BM25 score between the query and each candidate document, and then return the top-scoring documents.

2. Doc2Query

BM25 has a critical weakness: the vocabulary mismatch problem. Put simply, if you search for “spaghetti”, BM25 won’t naturally bring back documents that only say “pasta”.

To address this, sparse retrieval has explored many expansion techniques. Among them, Doc2Query is the most straightforward way to expand documents (proposed in a 2019 paper).

2–1. Doc2Query Framework

As the name suggests, the idea is to use a generative model to create multiple plausible queries from each document.

image

In the original paper, they trained a Transformer from scratch on passage–query pairs: given a passage as input, the model generates a corresponding query. (Back in 2019 there weren’t strong off-the-shelf generative models, so a vanilla Transformer was used.)

After training, they feed each document to the model and generate 10 queries via top-k random sampling. These generated queries are then appended to the original document text (no special delimiter or token to separate “document” from “query,” just a literal string concatenation). The expanded documents are re-indexed, and retrieval is performed with BM25 for evaluation.

2–2. Evaluation

image

At the time, the common setup was BM25 for first-stage retrieval and BERT for re-ranking. In the paper’s tables, the top rows report re-ranker results, the middle rows show first-stage retrieval only, and the bottom rows combine first-stage retrieval + re-ranking. (The first two re-rankers were state of the art then; RM3 is a classic query expansion method that generates candidate terms from the query side.)

On MS MARCO and TREC-CAR, Doc2Query achieved SOTA in first-stage retrieval at the time.

2–3. Qualitative Analysis

image

Qualitatively, the generative model often reuses terms from the input document when forming queries, effectively re-weighting those terms (highlighting what matters). It also introduces new terms not present in the document, directly tackling the vocabulary mismatch. In fact, the ratio of “copied terms : newly generated terms” was reported to be 69 : 31.

3. DeepCT

Despite earlier methods, BM25 and Doc2Query were criticized for not adequately capturing contextual meaning. In other words, we need to pay closer attention to term importance.

image

For example, the word stomach in the example might appear twice in two different passages, but in one passage it is central to the topic, while in the other it is only mentioned in passing. DeepCT assumes that many queries arise from a document’s key idea and focuses on identifying which terms reflect the central meaning. Its goals are:

  1. Find important passage terms for passage retrieval (DeepCT),
  2. Find important terms in queries (DeepCT-Query).

3–1. DeepCT Framework

DeepCT uses BERT to extract a word’s contextual meaning and compute a term weight. Simply, it predicts each token’s importance from context, using the following equation:

y^t,c=wTt,c+b \hat{y}_{t,c} = w^{\top} T_{t,c} + b

T_{t,c} is the embedding of term t within text c produced by BERT, and w and b are the weights and bias of a linear layer. Thus, \hat{y}_{t,c} becomes the predicted word-importance score. The model is trained by minimizing the MSE loss between the predicted value and each token’s ground-truth (GT) term weight.

How do we define a token’s GT importance score (GT term weight) in a passage? The authors borrow Query Term Recall (QTR) from prior work. The formula is:

QTR(t,d)=Qd,tQd \mathrm{QTR}(t, d) = \frac{|Q_{d,t}|}{|Q_d|}

This represents “the proportion of queries (among those related to passage d) that contain term t.” The assumption is that queries generally reflect a document’s key idea, just as people tend to ask about a document’s central concepts. Therefore, terms that appear in queries related to a document are more important than terms from unrelated documents.

With the GT defined, training proceeds by minimizing the MSE loss between the predicted term weights and the GT term weights:

LMSE(c)=1ctc(y^t,cyt,c)2 \mathcal{L}_{\mathrm{MSE}}(c) = \frac{1}{|c|} \sum_{t \in c}\left(\hat{y}_{t,c} - y_{t,c}\right)^2

Once training is complete, the model can run inference on a single passage and output the weight for each term within that passage.

Before retrieval, the DeepCT model is run over all passages; the predicted weights are then scaled into an integer range for indexing.

TFDeepCT(t,d)=round ⁣(Ny^t,d)(N=100) \mathrm{TF}_{\text{DeepCT}}(t, d) = \operatorname{round}\!\bigl(N \cdot \hat{y}_{t,d}\bigr) \qquad (N = 100)

These integerized values are denoted TF_DeepCT(t, d), and they are inserted into the inverted index before search. In other words, when (re)indexing, the importance score of each term in a specific document is what gets recorded.

Example of how the index changes:

  • Original inverted index: "DNA": {"doc_1": 3, "doc_5": 2}
  • DeepCT inverted index: "DNA": {"doc_1": 9, "doc_5": 7} (Now the importance of "DNA" in each document is indexed)

3–2. DeepCT-Query

Going a step further, the authors argue we should also capture important terms from the query side. To do this, they similarly extract term weights from queries.

A query term’s GT importance score is defined using Term Recall (TR). The formula is:

TR(t,q)=Dq,tDq \mathrm{TR}(t, q) = \frac{|D_{q,t}|}{|D_q|}

As before, D_q denotes the set of documents relevant to query q, and D_{q,t} is the subset of those documents that contain term t. The assumption is that a query term that appears in a larger fraction of its relevant documents is more important.

They prepare query–document pairs and train to minimize the MSE loss between predictions and GT. At inference time, given a query, the model outputs weights for each query token.

When a query arrives at retrieval time, the model produces importance scores (weights) for each query term. Those weights are then used to form BoW queries and SDM queries (which account for bigrams and within-window co-occurrences).

For example, when the query is “apple pie”:

  • term weights: (0.8 apple) (0.7 pie)
  • bigram-based expansion = when “apple” and “pie” appear consecutively
    (0.56 apple pie)
  • windowed co-occurrence = when “apple” and “pie” appear within a specific window

After passing through the DeepCT model and SDM, the final weighted query becomes something like: #weight (0.8 apple 0.7 pie 0.5 #1(apple pie) 0.3 #uw8(apple pie)) (example with window size 8). Terms with negative weights are removed.

Thus, we obtain two variants: BoW-DeepCT-Query (using only term weights) and SDM-DeepCT-Query (after SDM expansion).

3–3. Experiment Setup

Baselines:

  • TF index (plain TF-based index)
  • TextRank (unsupervised graph-based term weighting)
  • Doc2Query
  • DeepCT-Index
  • DeepCT_W-Index (DeepCT with context-independent word embeddings instead of BERT)
  • DeepCT_E-Index (DeepCT with ELMo instead of BERT)

Indexing and retrieval:

  • First-stage ranking with BM25 and QL.
  • They also tested re-ranking with Conv-KNRM and a BERT re-ranker; here we focus on first-stage retrieval.

Metrics:

  • MRR@10 on MS MARCO
  • MAP@1000 on TREC-CAR

3–4. Results

image

  1. DeepCT-Index delivers the best first-stage retrieval performance, surpassing Doc2Query even though both are trained on similar data.
  2. Replacing BERT with context-independent embeddings (DeepCT_W-Index) or ELMo (DeepCT_E-Index) leads to lower performance, highlighting the effectiveness of BERT.

image

  1. On the query side, using DeepCT-Query with SDM-based expansions yields the strongest average results among query-side variants.

3–5. Qualitative Analysis

This paper also conducts qualitative analysis comparing documents relevant and irrelevant to a given query.

image

As the figure shows, for a query about “susan boyle,” the relevant document (an article introducing Susan Boyle) assigns higher weights to the corresponding terms, whereas an unrelated document that merely mentions her assigns lower weights.

They also present the weight distribution by averaging the highest-weighted term from each passage in each dataset (e.g., take the top 10 terms per passage and plot their weights).

image

Compared to the relatively flat distribution of the original TF index, DeepCT-Index clearly emphasizes important words. In other words, DeepCT-Index highlights a small set of central terms and suppresses others.

As the authors note, a limitation of this approach is the assumption that “queries always derive from a document’s central idea.” If a query instead targets peripheral details (e.g., exam questions that hinge on minor points), DeepCT’s method could backfire.

4. SparTerm

DeepCT does not fully resolve BM25’s “vocabulary mismatch” limitation. To increase semantic-level matching, SparTerm proposes an approach that explicitly expands terms.

4-1. SparTerm Architecture

SparTerm uses the following architecture for term expansion.

image

The SparTerm architecture consists of two modules: an Importance Predictor and a Gating Controller. The final representation is the element-wise product of the two:

p=F(p)    G(p) p' = \mathcal{F}(p) \;\odot\; \mathcal{G}(p)

  • F is the importance predictor that produces a dense vector.
  • G is a binary gating vector that produces a sparse vector.

4-1-1. Importance Predictor

image

Given an input passage p, the importance predictor outputs the semantic importance over the vocabulary for each term in p. As shown above, it first extracts token-level hidden states from a PLM, then feeds each hidden state into a token-wise importance predictor to obtain a token-wise importance distribution. The dense importance distribution over all vocabulary terms for the i-th token is:

Ii=Transform(hi)E+b I_i = \mathrm{Transform}(h_i)\,E^{\top} + b

  • h_i: the token’s embedding (hidden state)
  • Transform: a linear transformation with GELU activation and layer normalization
  • E: the shared word-embedding matrix; b: a bias term

This mirrors the BERT MLM head: h_i is the final-layer hidden state for the i-th input token, and the token-wise importance predictor plays the role of an MLM head to produce a distribution over the entire BERT vocabulary for that token.

SparTerm then aggregates token-level distributions by applying ReLU and summing:

I=i=0LReLU ⁣(Ii) I = \sum_{i=0}^{L} \mathrm{ReLU}\!\left(I_i\right)

In short, this yields a vocabulary distribution that reflects the full passage context.

4-1-2. Gating Controller

image

The Gating Controller outputs a binary gating signal deciding which terms to activate to represent the passage. Naturally, all terms present in the original passage should be activated; to mitigate lexical mismatch, it is also desirable to activate additional topic-related terms.

SparTerm therefore proposes two gating controllers - Literal-only gating and Expansion-enhanced gating:

  • Literal-only gating activates only the terms that appear in the original text (bag-of-words activation):

G(p)=BOW(p) \mathcal{G}(p) = \mathrm{BOW}(p)

  • Expansion-enhanced gating aims to reduce the lexical mismatch gap by activating additional, contextually relevant terms. The procedure:

    1. As with the importance predictor, obtain a passage-wise dense gating distribution G (logits over the full vocabulary for the passage).
    2. G' = Binarizer(G): set entries to 1 if their probability ≥ k (e.g., k = 0.7), otherwise 0. This keeps only high-probability terms given the passage context.
    3. G_e = G' ⊙ ¬BOW(p): keep only those high-probability terms not already present in the passage.
    4. Final expansion-enhanced gating vector: G_{le} = G_e + BOW(p), i.e., combine high-probability new terms with the original passage terms into a single sparse vector.

4-2. Training

In the SparTerm framework, the Importance Predictor and the Gating Controller are trained separately.

First, the Importance Predictor is trained with the following loss:

Lrank(qi,pi,+,pi,)=logesim(qi,pi,+)esim(qi,pi,+)+esim(qi,pi,) L_{\mathrm{rank}}(q_i, p_{i,+}, p_{i,-}) = - \log \frac{e^{\,\mathrm{sim}(q_i',\,p_{i,+}')}}{\,e^{\,\mathrm{sim}(q_i',\,p_{i,+}')} + e^{\,\mathrm{sim}(q_i',\,p_{i,-}')}}

This is the familiar triplet-style loss (InfoNCE) used in dense retrieval. Although one might question why a ranking loss is used to learn term distributions, the paper argues that learning importance from a distant supervisory signal tied to passage ranking is effective.

For the Gating Controller, the authors outline four scenarios where term expansion may occur:

image

Intuitively, synonymy and frequent co-occurrence can already be handled by MLM pretraining. Thus, the paper focuses on passage2query (predict queries that could be asked about the passage) and summarization (predict words likely to appear in a summary of the passage). They train only on passage–query and passage–summary data, using two cross-entropy terms:

Lexp=λ1j{mTm=0}log ⁣(1Gj)    λ2k{mTm=1}log ⁣(Gk) L_{\mathrm{exp}} = -\lambda_1 \sum_{j \in \{\,m \mid T_m = 0\,\}} \log\!\left(1 - G_j\right) \;-\; \lambda_2 \sum_{k \in \{\,m \mid T_m = 1\,\}} \log\!\left(G_k\right)

  • p: passage, t: target text (query or summary)
  • T: binary BoW vector of t
  • G: dense gating probability distribution for p

Explanation:

  1. For the first term with G_j: these are words absent from the target text (m | T_m = 0). To minimize the loss, G_j should decrease, pushing down logits for words not in the target (i.e., discouraging irrelevant expansions).
  2. For the second term with G_k: these are words present in the target text (m | T_m = 1). The loss encourages larger G_k, boosting logits for target-consistent expansions.

The final objective sums the two:

L=Lrank+Lexp L = L_{\mathrm{rank}} + L_{\mathrm{exp}}

Training details:

  • Both modules are initialized from BERT-base, with no weight sharing.
  • Training is staged: fine-tune the Gating Controller first on the MS MARCO passage-retrieval training set for 50k steps.
  • Then freeze the Gating Controller and fine-tune SparTerm with the triplet-style loss.

4-4. Results

image

  • SparTerm (expansion-enhanced) achieves the best MRR and, for Recall, outperforms all baselines except Doc2Query-T5.
  • Since T5 outperforms the transformer used in Doc2Query, the authors suggest SparTerm could benefit from stronger PLMs than BERT as well.
  • Even without expansion, SparTerm (literal-only) outperforms DeepCT.

4-5. Qualitative Analysis

image

Both DeepCT and SparTerm assign high weights to the most important passage terms for a given query. However, DeepCT tends to focus its weight on fewer terms, while SparTerm captures a somewhat broader set. For example, for a query about “hives,” SparTerm assigns weight to “allergic reactions,” whereas DeepCT does not.

SparTerm also predicts “symptom” from “sign,” and “women” from “pregnancy,” demonstrating the effectiveness of term expansion.

image

This distribution shows, for each seed term (listed below the chart), the top-5 expanded terms. The expansions are coherent and frequently co-occur with the seeds, supporting SparTerm’s strengths.

5. SPLADE

SPLADE, which came afterward, is remarkably simple. By adding just a log activation and a sparsity regularizer on top of SparTerm, it achieved much better performance.

SPLADE points out that SparTerm is unnecessarily complex and cannot be trained end-to-end. As noted earlier, SparTerm first trains the Gating Controller, freezes it, and then trains the entire framework again. Moreover, SparTerm’s results suggest that a literal-only variant already performs well, raising doubts about how much the expansion actually helps.

image

5-1. SPLADE Methodology

5-1-1. Arichitecture

A single PLM (e.g., BERT-base) encodes both queries and passages. For each input token (x_i) with final hidden state (h_i), an MLM-style head projects into the vocabulary space:

zi  =  Transform(hi)E+b(Transform = Linear → GELU → LayerNorm) z_i \;=\; \mathrm{Transform}(h_i)\,E^\top + b \qquad \text{(Transform = Linear → GELU → LayerNorm)}

Here (E) is the shared word-embedding matrix (vocabulary size (|V|)). SPLADE converts per-token logits ({z_i}) into a document-level (or query-level) vocabulary vector via log-saturated summation:

wj  =  i=1Llog ⁣(1+max(0,zi,j)) w_j \;=\; \sum_{i=1}^{L} \log\!\bigl(1 + \max(0,\, z_{i,j})\bigr)

  • The ReLU ensures only positive evidence accumulates.
  • The (\log(1+\cdot)) caps runaway contributions, encouraging sparsity even without extra penalties.
  • The result (w \in \mathbb{R}^{|V|}) is sparse in practice (most entries (\approx 0)) and lives directly in the lexical vocabulary—so we can build a standard inverted index.

5-1-2. Scoring

At retrieval time, we compute a simple dot product between the sparse query vector and the sparse document vector: s(q,d)  =  w(q),w(d) s(q,d) \;=\; \langle w^{(q)},\, w^{(d)} \rangle This is fast in inverted-index engines because only overlapping activated terms contribute.

5-1-3. InfoNCE Loss w/ in-batch negatives

The loss is also slightly changed: instead of a triplet loss over (query, positive, negative), SPLADE trains with InfoNCE using in-batch negatives for each (query, positive).

Lrank-IBN  =  loges(qi,di+)es(qi,di+)  +  es(qi,di)  +  jes(qi,di,j) \mathcal{L}_{\text{rank-IBN}} \;=\; - \log \frac{e^{\,s(q_i, d_i^{+})}} {\,e^{\,s(q_i, d_i^{+})} \;+\; e^{\,s(q_i, d_i^{-})} \;+\; \sum_{j} e^{\,s(q_i, d_{i,j}^{-})}}

5-1-4. FLOPS Regularizer

A key element in SPLADE is the FLOPS regularizer. Efficiency is governed by how many termwise multiplications we do per (query, document) pair—intuitively, by how often terms are “activated” SPLADE reports an empirical “FLOPS” metric (expected term-matches) to capture this.

Training surrogate (FLOPS regularizer).
To learn efficient representations, SPLADE adds a penalty that discourages terms that are activated too often across a batch. Let aj  =  1Ni=1Nwj(di) a_j \;=\; \frac{1}{N}\sum_{i=1}^{N} w^{(d_i)}_j (a_j is a average activation of term (j) over a batch of documents) Then FLOPS  =  jVaj2. \ell_{\text{FLOPS}} \;=\; \sum_{j \in V} a_j^2 \, . Squaring a_j hits stopword-like dimensions hard, pushing the model toward a balanced postings distribution (no single term active everywhere). This yields lower runtime cost than a plain penalty at comparable accuracy.

Two knobs that matter in practice.

  • Separate strengths for queries and docs: lambda_q (usually stronger) and lambda_d. Putting more pressure on queries reduces latency the most.
  • Warm-up / ramp: increase the sparsity weight(s) gradually (e.g., quadratic ramp to ~50k steps), so early learning isn’t dominated by sparsification.

Final objective (per batch). L  =  Lrank  +  λqLreg(q)  +  λdLreg(d)with Lreg{FLOPS,1}. \mathcal{L} \;=\; \mathcal{L}_{\text{rank}} \;+\; \lambda_q\,\mathcal{L}^{(q)}_{\text{reg}} \;+\; \lambda_d\,\mathcal{L}^{(d)}_{\text{reg}} \quad \text{with } \mathcal{L}_{\text{reg}} \in \{\ell_{\text{FLOPS}},\,\ell_1\}.

5–2. Training details & Experimental setup

For training and evaluation, they use the MS MARCO passage-ranking dataset’s train and dev sets, and also evaluate on the TREC 2019 eval set.

The model is initialized from BERT-base and trained for 150k steps with a batch size of 124.
Before retrieval, each document is passed through the SPLADE model to obtain a sparse vector of vocabulary logits. At query time, the query is expanded in the same way; the final score is the dot product between the query’s and document’s sparse vectors.

image

image

5-3. Results

Baselines:

  • BM25
  • DeepCT
  • Doc2query-T5
  • SparTerm (both lexical-only and expansion)
  • ANCE (Dense)
  • TCT-ColBERT (Dense)
  • SparTerm-lexical-only (w/ InfoNCE)
  • SparTerm-exp (SparTerm with log saturation and trained with InfoNCE)
  • SPLADE

image

  1. Except for TREC DL Recall@1000, SPLADE achieves the best performance across datasets and metrics.
  2. It is competitive even against the (then) SOTA dense retrievers.
  3. Training SparTerm lexical-only with InfoNCE outperforms the original SparTerm expansion model—highlighting the effect of InfoNCE vs. triplet loss.
  4. FLOPS efficiency: SPLADE with FLOPS regularization attains similar effectiveness while lowering compute (e.g., SPLADE-ℓ_FLOPS ≈ 0.73 vs. SPLADE-ℓ1 ≈ 0.88 FLOPS), and is far more efficient than SparTerm variants (≈2.8–4.6), approaching BM25-level cost (≈0.13).

5-4. Qualitative Analysis

image

From the example above, after passing through the model, important terms are re-weighted while unnecessary ones are suppressed (likely those whose term weights become ≤ 0). We can also observe effective expansions such as legs → leg and bow legs → treatment.

6. SPLADE-v2

A natural question is whether we really need to sum logits over all tokens in the sequence when computing the token-importance score. Wouldn’t it suffice to use just the position where a given vocabulary term attains its highest logit? SPLADE-v2 changes the pooling strategy accordingly. In equations, what used to be

wj  =  itlog ⁣(1+ReLU(wij)) w_j \;=\; \sum_{i \in t} \log\!\bigl(1 + \mathrm{ReLU}(w_{ij})\bigr)

becomes

wj  =  maxit  log ⁣(1+ReLU(wij)) w_j \;=\; \max_{i \in t} \;\log\!\bigl(1 + \mathrm{ReLU}(w_{ij})\bigr)

Thus, SPLADE-v1 obtains a vocabulary weight by summing over all token positions, whereas SPLADE-v2 uses only the largest logit among positions—hence the name SPLADE-max.

image

Empirically, SPLADE-max (SPLADE-v2) yields better performance on all tasks and metrics.

7. Conclusion

Sparse retrieval has evolved from simple frequency statistics to neural models that learn which terms matter and how to activate them. TF-IDF and BM25 gave us a fast, interpretable baseline and the inverted index that still powers large-scale search. Doc2Query attacked vocabulary mismatch by expanding documents with likely queries. DeepCT shifted the focus to contextual importance, predicting which terms represent a passage or a query. SparTerm added explicit expansion via a learned gate, but paid a price in complexity and training stages. SPLADE simplified the recipe: keep the MLM head, add log-saturation, train with InfoNCE, and control efficiency with a FLOPS regularizer. SPLADE-v2’s max pooling pushed effectiveness further without sacrificing the sparse, index-friendly interface.

The picture that emerges is clear: sparse methods remain competitive because they combine lexical precision, transparency, and production efficiency. With the FLOPS regularizer, modern neural sparse models can approach BM25-like cost while delivering strong ranking quality. And since scores decompose over terms, debugging and policy controls stay straightforward, which matters in real systems.

Sparse retrieval is not a relic. It is a modern, controllable interface between language and search systems, and with careful regularization and training objectives, it scales while staying understandable.

References

Community

Sign up or log in to comment