Vocabulary is the most important element of Sparse Retrieval
In sparse retrieval (especially SPLADE-like models), vocabularies play a big role more than just list of words. The vocabulary defines the model’s interaction space because its size sets the dimensionality of the sparse representation itself.
While training Korean SPLADE models, I ran into issues that traced back to the vocabulary. After discussing this with the sentence-transformers maintainers and the OpenSearch community (see: https://github.com/UKPLab/sentence-transformers/issues/3431), it became clear that vocabularies are absolutely critical for learned sparse retrieval.
In this post, I'll be sharing my experiments from training SPLADE. If you’re new to sparse embedding models, I recommend reading my previous article: https://huggingface.co/blog/yjoonjang/the-past-and-present-of-sparse-retrieval.
How to choose the best backbone model for sparse retriever
1. Experiment setup
1-1. Backbone models
I trained four backbone models on a ~900k local dataset. The models are:
model | vocab size | tokenizer language |
---|---|---|
klue/roberta-base | 32000 | ko |
skt/A.X-Encoder-base | 50000 | ko, en |
Alibaba-NLP/gte-multilingual-base | 250048 | multilingual (70+) |
jhu-clsp/mmBERT-base | 256000 | multilingual (1800+) |
klue/roberta-base and skt/A.X-Encoder-base use tokenizers centered on Korean (the latter also covers English). Alibaba-NLP/gte-multilingual-base uses a multilingual tokenizer supporting 70+ languages. jhu-clsp/mmBERT-base is MLM-pretrained with the gemma2 tokenizer that targets 1800+ languages.
1-2. Model training
I used contrastive learning, which is a practical and efficient way to train sparse retrievers. All training was done with sentence-transformers.
- dataset: triplets of
<query, positive, hard_negative>
- batch_size: 8
- max_len: 512
- query_regularizer_weight: 5e-5
- document_regularizer_weight: 3e-5
- bf16 mixed-precision training
2. Results
2-1. Loss
Note: klue/roberta-base was trained on 4 GPUs while the others used 8 GPUs, so its logs appear at half the frequency.
- Overall
train/loss
(right panel) starts high and converges cleanly, indicating stable training across models. - Why does gte-multilingual-base spike on the regularizer early on? SPLADE’s sparsity regularizer (e.g., FLOPS or L1) penalizes the number of active dimensions. gte-multilingual-base has a very large multilingual vocab (~250k); early in training it tends to light up too many terms at once (dense activation), which produces a high regularizer loss spike. As training proceeds, the sparsity term quickly suppresses unnecessary activations and the curve stabilizes.
2-2. Validation
During training, I validated every 0.05 epoch with AutoRAGRetrieval and tracked Recall@10.
- As the graph shows, jhu-clsp/mmBERT-base improves for a while and then drops sharply—this matches the failure mode discussed in the GitHub issue above.
- The other three models steadily improve, and the two with Korean-aware vocabularies (klue/roberta-base, skt/A.X-Encoder-base) perform particularly well.
2-3. MTEB-ko-retrieval evaluation
After training, I evaluated on MTEB-ko-retrieval.
model | avg Recall@10 | avg NDCG@10 | avg MRR@10 | avg Query Active Dims | avg Corpus Active Dims |
---|---|---|---|---|---|
A.X-Encoder-base | 0.731 | 0.6618 | 0.6882 | 84.2279 | 650.6541 |
roberta-base | 0.6751 | 0.6234 | 0.6593 | 28.3942 | 188.0523 |
gte-multilingual-base | 0.61 | 0.5224 | 0.5385 | 1115.8582 | 2728.6814 |
mmBERT-base | 0.023 | 0.0103 | 0.0065 | 0 | 0 |
- As expected, the Korean-centric vocabularies yield higher scores.
- The most striking case is mmBERT-base, which scores near zero. The reason is visible in the active dimension counters: both query and corpus activations collapsed to zero.
Up to step 2844, both query and corpus were active and validation was healthy:
But by step 4266, both query and corpus active dimensions fell to 0, and validation metrics simultaneously crashed:
2-4. Analysis — Why did mmBERT collapse?
This is a representation collapse in SPLADE-style training: the model drives all activations to zero, so queries and documents become all-zero sparse vectors and retrieval is impossible.
Why does this happen?
Tokenizer–language mismatch
jhu-clsp/mmBERT-base uses the gemma2 tokenizer for ~1800 languages. On Korean data, this can over-fragment tokens or map them to extremely rare subwords. When the model cannot project Korean text meaningfully into the vocab space, it learns that the “safest” way to reduce loss (under sparsity pressure) is to output zeros. In SPLADE, because the vocab directly is the output space, this mismatch can destroy the representation entirely.Over-aggressive sparsity regularization
SPLADE pushes representations to be sparse with FLOPS/L1 penalties. If the activations are already weak (due to mismatch), the regularizer makes “all zeros” the easiest optimum, accelerating collapse.Normalization and scale issues
SPLADE typically useslog(1 + ReLU(Wx))
. If encoder outputs are poorly scaled (e.g., LayerNorm not well adapted to Korean distributions), most values can die at ReLU, leaving zeros everywhere.
In short, the mmBERT vocabulary did not cover Korean well enough, and the sparsity regularizer finished the job by zeroing out what little survived—leading to the observed drop in active dims and Recall@10.
3. Conclusion
In these experiments with four backbones, I analyzed how an MLM model’s vocabulary and pretrained weights influence final performance when training a sparse retriever.
The results are straight: performance is not determined only by backbone size or generic pretraining quality. What matters most is how the vocabulary defines the model’s representation space.
- klue/roberta-base and skt/A.X-Encoder-base (Korean-aware vocabularies) trained stably and achieved strong scores across metrics.
- Alibaba-NLP/gte-multilingual-base trained successfully but exhibited very broad (noisy) activation early on, which the sparsity regularizer later pruned.
- jhu-clsp/mmBERT-base failed catastrophically: its vocabulary did not adequately cover Korean, leading to representation collapse (all-zero activations) and near-zero retrieval quality.
This goes beyond “the tokenizer matters.” In learned sparse retrieval, the vocabulary directly sets the dimensionality and structure of the representation: it defines the interaction space where queries and documents communicate. A well-aligned vocabulary yields a meaningful retrieval space; a poorly aligned one destroys it.
Takeaway:
Before asking “Which backbone should I use?”, I recommend asking “Can this model’s vocabulary properly express the language in my data?”. Finding model with the best vocabualry that fits with your data will lead to performance gain.