Add new SparseEncoder model

1499ddf verified 28 days ago

20.4 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- sentence-transformers
	- sparse-encoder
	- sparse
	- asymmetric
	- inference-free
	- splade
	- generated_from_trainer
	- dataset_size:9000
	- loss:SpladeLoss
	- loss:SparseMultipleNegativesRankingLoss
	- loss:FlopsLoss
	- dataset_size:89000
	base_model: distilbert/distilbert-base-uncased
	widget:
	- text: Blank Neoprene Water Bottle Coolies (Variety Color 10 Pack)
	- text: Dream Spa 3-way 8-Setting Rainfall Shower Head and Handheld Shower Combo (Chrome).
	Use Luxury 7-inch Rain Showerhead or 7-Function Hand Shower for Ultimate Spa Experience!
	- text: ¿Está disponible el nuevo iPhone 7 Plus?
	- text: Naipo Back Massager Massage Chair Vibrating Car Seat Cushion for Back, Neck,
	and Thigh with 8 Motor Vibrations 4 Modes 3 Speed Heating at Home Office Car
	- text: Pizuna 400 Thread Count Cotton Fitted-Sheet Queen Size White 1pc, 100% Long
	Staple Cotton Sateen Fitted Bed Sheet With All Around Elastic Deep Pocket Queen
	Sheets Fit Up to 15Inch (White Fitted Sheet)
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	metrics:
	- dot_accuracy@1
	- dot_accuracy@3
	- dot_accuracy@5
	- dot_accuracy@10
	- dot_precision@1
	- dot_precision@3
	- dot_precision@5
	- dot_precision@10
	- dot_recall@1
	- dot_recall@3
	- dot_recall@5
	- dot_recall@10
	- dot_ndcg@10
	- dot_mrr@10
	- dot_map@100
	- query_active_dims
	- query_sparsity_ratio
	- corpus_active_dims
	- corpus_sparsity_ratio
	model-index:
	- name: Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions
	tuples
	results:
	- task:
	type: sparse-information-retrieval
	name: Sparse Information Retrieval
	dataset:
	name: NanoMSMARCO
	type: NanoMSMARCO
	metrics:
	- type: dot_accuracy@1
	value: 0.3
	name: Dot Accuracy@1
	- type: dot_accuracy@3
	value: 0.58
	name: Dot Accuracy@3
	- type: dot_accuracy@5
	value: 0.66
	name: Dot Accuracy@5
	- type: dot_accuracy@10
	value: 0.76
	name: Dot Accuracy@10
	- type: dot_precision@1
	value: 0.3
	name: Dot Precision@1
	- type: dot_precision@3
	value: 0.19333333333333336
	name: Dot Precision@3
	- type: dot_precision@5
	value: 0.132
	name: Dot Precision@5
	- type: dot_precision@10
	value: 0.07600000000000001
	name: Dot Precision@10
	- type: dot_recall@1
	value: 0.3
	name: Dot Recall@1
	- type: dot_recall@3
	value: 0.58
	name: Dot Recall@3
	- type: dot_recall@5
	value: 0.66
	name: Dot Recall@5
	- type: dot_recall@10
	value: 0.76
	name: Dot Recall@10
	- type: dot_ndcg@10
	value: 0.5302210774188797
	name: Dot Ndcg@10
	- type: dot_mrr@10
	value: 0.45638095238095233
	name: Dot Mrr@10
	- type: dot_map@100
	value: 0.4675385567218492
	name: Dot Map@100
	- type: query_active_dims
	value: 6.380000114440918
	name: Query Active Dims
	- type: query_sparsity_ratio
	value: 0.9997909704437966
	name: Query Sparsity Ratio
	- type: corpus_active_dims
	value: 813.6908569335938
	name: Corpus Active Dims
	- type: corpus_sparsity_ratio
	value: 0.9733408408055306
	name: Corpus Sparsity Ratio
	---

	# Inference-free SPLADE distilbert-base-uncased trained on Natural-Questions tuples

	This is a [Asymmetric Inference-free SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
	## Model Details

	### Model Description
	- Model Type: Asymmetric Inference-free SPLADE Sparse Encoder
	- Base model: [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) <!-- at revision 12040accade4e8a0f71eabdb258fecc2e7e948be -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 30522 dimensions
	- Similarity Function: Dot Product
	<!-- - Training Dataset: Unknown -->
	- Language: en
	- License: apache-2.0

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Documentation: [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)

	### Full Model Architecture

	```
	SparseEncoder(
	(0): Router(
	(sub_modules): ModuleDict(
	(query): Sequential(
	(0): SparseStaticEmbedding({'frozen': False}, dim=30522, tokenizer=DistilBertTokenizerFast)
	)
	(document): Sequential(
	(0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
	(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
	)
	)
	)
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SparseEncoder

	# Download from the 🤗 Hub
	model = SparseEncoder("monkeypostulate/inference-free-splade-distilbert-base-uncased-nq")
	# Run inference
	queries = [
	"\u00bfHay una s\u00e1bana de algod\u00f3n ajustada disponible en tama\u00f1o queen?",
	]
	documents = [
	'Pizuna 400 Thread Count Cotton Fitted-Sheet Queen Size White 1pc, 100% Long Staple Cotton Sateen Fitted Bed Sheet With All Around Elastic Deep Pocket Queen Sheets Fit Up to 15Inch (White Fitted Sheet)',
	'ArtSocket Shower Curtain Teal Rustic Shabby Country Chic Blue Curtains Wood Rose Home Bathroom Decor Polyester Fabric Waterproof 72 x 72 Inches Set with Hooks',
	'AFARER Case Compatible with Samsung Galaxy S7 5.1 inch, Military Grade 12ft Drop Tested Protective Case with Kickstand,Military Armor Dual Layer Protective Cover - Blue',
	]
	query_embeddings = model.encode_query(queries)
	document_embeddings = model.encode_document(documents)
	print(query_embeddings.shape, document_embeddings.shape)
	# [1, 30522] [3, 30522]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(query_embeddings, document_embeddings)
	print(similarities)
	# tensor([[13.2777, 7.2952, 2.9255]])
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Sparse Information Retrieval

	* Dataset: `NanoMSMARCO`
	* Evaluated with [<code>SparseInformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sparse_encoder/evaluation.html#sentence_transformers.sparse_encoder.evaluation.SparseInformationRetrievalEvaluator)

	\| Metric \| Value \|
	\|:----------------------\|:-----------\|
	\| dot_accuracy@1 \| 0.3 \|
	\| dot_accuracy@3 \| 0.58 \|
	\| dot_accuracy@5 \| 0.66 \|
	\| dot_accuracy@10 \| 0.76 \|
	\| dot_precision@1 \| 0.3 \|
	\| dot_precision@3 \| 0.1933 \|
	\| dot_precision@5 \| 0.132 \|
	\| dot_precision@10 \| 0.076 \|
	\| dot_recall@1 \| 0.3 \|
	\| dot_recall@3 \| 0.58 \|
	\| dot_recall@5 \| 0.66 \|
	\| dot_recall@10 \| 0.76 \|
	\| dot_ndcg@10 \| 0.5302 \|
	\| dot_mrr@10 \| 0.4564 \|
	\| dot_map@100 \| 0.4675 \|
	\| query_active_dims \| 6.38 \|
	\| query_sparsity_ratio \| 0.9998 \|
	\| corpus_active_dims \| 813.6909 \|
	\| corpus_sparsity_ratio \| 0.9733 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### Unnamed Dataset

	* Size: 89,000 training samples
	* Columns: <code>query</code> and <code>document</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| query \| document \|
	\|:--------\|:----------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------\|
	\| type \| string \| string \|
	\| details \| <ul><li>min: 8 tokens</li><li>mean: 21.52 tokens</li><li>max: 44 tokens</li></ul> \| <ul><li>min: 8 tokens</li><li>mean: 33.4 tokens</li><li>max: 93 tokens</li></ul> \|
	* Samples:
	\| query \| document \|
	\|:-------------------------------------------------------------------\|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| <code>¿Hay una lámpara de colgar con batería disponible?</code> \| <code>Farmhouse Plug in Pendant Light with On/Off Switch Wire Caged Hanging Pendant Lamp 16ft Cord</code> \|
	\| <code>¿Hay leggings con bolsillos disponibles para mujeres?</code> \| <code>IUGA High Waist Yoga Pants with Pockets, Tummy Control, Workout Pants for Women 4 Way Stretch Yoga Leggings with Pockets</code> \|
	\| <code>¿Cuál es la tapa de oscuridad marrón disponible?</code> \| <code>Thicken It 100% Scalp Coverage Hair Powder - DARK BROWN - Talc-Free .32 oz. Water Resistant Hair Loss Concealer. Naturally Thicker Than Hair Fibers & Spray Concealers</code> \|
	* Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
	```json
	{
	"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
	"document_regularizer_weight": 0.003,
	"query_regularizer_weight": 0
	}
	```

	### Evaluation Dataset

	#### Unnamed Dataset

	* Size: 1,000 evaluation samples
	* Columns: <code>query</code> and <code>document</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| query \| document \|
	\|:--------\|:----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|
	\| type \| string \| string \|
	\| details \| <ul><li>min: 8 tokens</li><li>mean: 20.94 tokens</li><li>max: 40 tokens</li></ul> \| <ul><li>min: 8 tokens</li><li>mean: 33.09 tokens</li><li>max: 79 tokens</li></ul> \|
	* Samples:
	\| query \| document \|
	\|:-------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| <code>¿Qué es un modelo anatómico del corazón?</code> \| <code>Axis Scientific Heart Model, 2-Part Deluxe Life Size Human Heart Replica with 34 Anatomical Structures, Held Together with Magnets, Includes Mounted Display Base, Detailed Product Manual and Warranty</code> \|
	\| <code>¿Hay un buscador de peces portátil disponible?</code> \| <code>HawkEye Fishtrax 1C Fish Finder with HD Color Virtuview Display, Black/Red, 2" H x 1.6" W Screen Size</code> \|
	\| <code>¿Hay un disfraz de Anna adulta de Frozen disponible para comprar?</code> \| <code>Mitef Anime Cosplay Costume Princess Anna Fancy Dress with Shawl for Adult, L</code> \|
	* Loss: [<code>SpladeLoss</code>](https://sbert.net/docs/package_reference/sparse_encoder/losses.html#spladeloss) with these parameters:
	```json
	{
	"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
	"document_regularizer_weight": 0.003,
	"query_regularizer_weight": 0
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `eval_strategy`: steps
	- `per_device_train_batch_size`: 256
	- `per_device_eval_batch_size`: 256
	- `learning_rate`: 2e-05
	- `warmup_ratio`: 0.1
	- `batch_sampler`: no_duplicates
	- `router_mapping`: {'query': 'query', 'answer': 'document'}

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: steps
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 256
	- `per_device_eval_batch_size`: 256
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 2e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1.0
	- `num_train_epochs`: 3
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.1
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: False
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: no_duplicates
	- `multi_dataset_batch_sampler`: proportional
	- `router_mapping`: {'query': 'query', 'answer': 'document'}
	- `learning_rate_mapping`: {}

	</details>

	### Training Logs
	\| Epoch \| Step \| Training Loss \| NanoMSMARCO_dot_ndcg@10 \|
	\|:------:\|:----:\|:-------------:\|:-----------------------:\|
	\| 0.5747 \| 200 \| 3.33 \| - \|
	\| 1.1494 \| 400 \| 2.755 \| - \|
	\| -1 \| -1 \| - \| 0.5302 \|


	### Framework Versions
	- Python: 3.9.6
	- Sentence Transformers: 5.1.0
	- Transformers: 4.55.0
	- PyTorch: 2.8.0
	- Accelerate: 1.10.0
	- Datasets: 4.0.0
	- Tokenizers: 0.21.4

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### SpladeLoss
	```bibtex
	@misc{formal2022distillationhardnegativesampling,
	title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
	author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
	year={2022},
	eprint={2205.04733},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2205.04733},
	}
	```

	#### SparseMultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	#### FlopsLoss
	```bibtex
	@article{paria2020minimizing,
	title={Minimizing flops to learn efficient sparse representations},
	author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
	journal={arXiv preprint arXiv:2004.05665},
	year={2020}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->