Update README.md

0c5c10f verified 4 months ago

4.65 kB

	---
	tags:
	- ColBERT
	- PyLate
	- contextual-embeddings
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	# - dataset_size:640000
	# - loss:Distillation
	---

	# ModernColBERT + InSeNT

	[![arXiv](https://img.shields.io/badge/arXiv-2505.24782-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2505.24782)
	[![GitHub](https://img.shields.io/badge/Code_Repository-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/illuin-tech/contextual-embeddings)
	[![Hugging Face](https://img.shields.io/badge/ConTEB_HF_Page-FFD21E?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/illuin-conteb)

	<img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/jq_zYRy23bOZ9qey3VY4v.png" width="800">


	This is a contextual model finetuned from [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) on the ConTEB training dataset. It was trained using the InSeNT training approach, detailed in the corresponding paper.

	> [!WARNING]
	> This experimental model stems from the paper [Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings](https://arxiv.org/abs/2505.24782).
	> While results are promising, we have seen regression on standard embedding tasks, and using it in production will probably require further work on extending the training set to improve robustness and OOD generalization.


	## Usage

	### Direct Usage

	First install the `contextual-embeddings` package:

	```bash
	pip install git+https://github.com/illuin-tech/contextual-embeddings
	```

	To run inference with a contextual model, you can use the following examples:
	```python
	from contextual_embeddings import LongContextEmbeddingModel
	from pylate.models import ColBERT

	documents = [
	[
	"The old lighthouse keeper trimmed his lamp, its beam cutting a lonely path through the fog.",
	"He remembered nights of violent storms, when the ocean seemed to swallow the sky whole.",
	"Still, he found comfort in his duty, a silent guardian against the treacherous sea."
	],
	[
	"A curious fox cub, all rust and wonder, ventured out from its den for the first time.",
	"Each rustle of leaves, every chirping bird, was a new symphony to its tiny ears.",
	"Under the watchful eye of its mother, it began to learn the secrets of the whispering forest."
	]
	]
	base_model = ColBERT("illuin-conteb/modern-colbert-insent")
	contextual_model = LongContextEmbeddingModel(
	base_model=base_model,
	pooling_mode="tokens"
	)
	embeddings = contextual_model.embed_documents(documents)
	print("Length of embeddings:", len(embeddings)) # 2
	print("Length of first document embedding:", len(embeddings[0])) # 3
	print(f"Shape of first chunk embedding: {embeddings[0][0].shape}") # torch.Size([22, 128])
	```


	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1)
	- Maximum Sequence Length: tokens
	- Output Dimensionality: 128 dimensions
	- Similarity Function: MaxSim
	- Training Dataset:
	- train
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Repository: [Contextual Embeddings](https://github.com/illuin-tech/contextual-embeddings)
	- Hugging Face: [Contextual Embeddings](https://huggingface.co/illuin-conteb)

	### Full Model Architecture

	```
	ColBERT(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
	(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
	)
	```


	## Citation

	```bibtex
	@misc{conti2025contextgoldgoldpassage,
	title={Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings},
	author={Max Conti and Manuel Faysse and Gautier Viaud and Antoine Bosselut and Céline Hudelot and Pierre Colombo},
	year={2025},
	eprint={2505.24782},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2505.24782},
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->