Abstract
Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
Community
We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world. Today, we're releasing the model on HuggingFace, along with the paper on ArXiv.
Typical text embedding models have two main problems
- training them is complicated and requires many tricks: giant batches, distillation, hard negatives...
- the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way
To fix (1) we develop a new training technique: contextual batching. all batches share a lot of context – one batch might be about horse races in Kentucky, the next batch about differential equations, etc.
This lets us get better performance without big batches or hard negative mining. There's also some cool theory behind it.
And for (2), we propose a new contextual embedding architecture. this requires changes to both the training and evaluation pipeline to incorporate contextual tokens – essentially, model sees extra text from the surrounding context, and can update the embedding accordingly.
If you use text embeddings, feel free to try cde-small-v1 on HuggingFace: https://huggingface.co/jxm/cde-small-v1 As noted, it's slightly more involved to use, since there's an extra step of embedding context tokens beforehand.
Let us know what you think!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LowREm: A Repository of Word Embeddings for 87 Low-Resource Languages Enhanced with Multilingual Graph Knowledge (2024)
- Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models (2024)
- Mistral-SPLADE: LLMs for better Learned Sparse Retrieval (2024)
- ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning (2024)
- Making Text Embedders Few-Shot Learners (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Cool idea!
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper