DistilBERT base trained on MIRIAD question-answer tuples

This is a SPLADE Sparse Encoder model finetuned from distilbert/distilbert-base-uncased on the miriad-4.4_m-split dataset using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: distilbert/distilbert-base-uncased
Maximum Sequence Length: 512 tokens
Output Dimensionality: 30522 dimensions
Similarity Function: Dot Product
Training Dataset:
- miriad-4.4_m-split
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False}) with MLMTransformer model: DistilBertForMaskedLM 
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("tomaarsen/splade-distilbert-base-uncased-miriad")
# Run inference
queries = [
    "What are the common symptoms experienced by individuals with sports-related concussions and how do they impact their overall health?\n",
]
documents = [
    'Individuals with sports-related concussions may experience a range of symptoms that can affect their physical, cognitive, behavioral, and emotional health. These symptoms can include dizziness, headache, poor sleep, and emotional problems. While 90% of people with a sports concussion recover within 7 to 10 days, at least 10% may experience prolonged symptoms. It is important to evaluate these symptoms as they can provide valuable information for estimating prognosis and predicting the time course and extent of expected recovery.',
    "The physical parameters used to evaluate the tablets included color and appearance, weight variation, hardness, friability, thickness, and disintegration time. These parameters are important indicators of the tablet's quality, stability, and suitability for human use.",
    "The risk factors for developing depression in Alzheimer's Disease (AD) include a family history of depressive symptoms, a personal history of depression, gender, and a young onset of AD. Sleep disturbances, which are common in AD, are also a key predictor of depressive symptoms.",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[43.7983,  0.0000,  9.6222]])

Evaluation

Metrics

Sparse Information Retrieval

Datasets: miriad_eval and miriad_test
Evaluated with SparseInformationRetrievalEvaluator

Metric	miriad_eval	miriad_test
dot_accuracy@1	0.9747	0.9765
dot_accuracy@3	0.9919	0.9931
dot_accuracy@5	0.9945	0.996
dot_accuracy@10	0.9964	0.998
dot_precision@1	0.9747	0.9765
dot_precision@3	0.3306	0.331
dot_precision@5	0.1989	0.1992
dot_precision@10	0.0996	0.0998
dot_recall@1	0.9747	0.9765
dot_recall@3	0.9919	0.9931
dot_recall@5	0.9945	0.996
dot_recall@10	0.9964	0.998
dot_ndcg@10	0.9867	0.9883
dot_mrr@10	0.9835	0.9851
dot_map@100	0.9837	0.9852
query_active_dims	28.7031	28.6886
query_sparsity_ratio	0.9991	0.9991
corpus_active_dims	64.087	64.3216
corpus_sparsity_ratio	0.9979	0.9979

Training Details

Training Dataset

miriad-4.4_m-split

Dataset: miriad-4.4_m-split at 596b9ab
Size: 100,000 training samples
Columns: question and answer
Approximate statistics based on the first 1000 samples:
question answer
type string string
details
min: 9 tokens
mean: 23.38 tokens
max: 71 tokens

min: 24 tokens
mean: 103.31 tokens
max: 315 tokens

	question	answer
type	string	string
details	min: 9 tokens mean: 23.38 tokens max: 71 tokens	min: 24 tokens mean: 103.31 tokens max: 315 tokens

Samples:

question	answer
`What factors may contribute to increased pulmonary conduit durability in patients who undergo the Ross operation compared to those with right ventricular outflow tract obstruction?`	`Several factors may contribute to increased pulmonary conduit durability in patients who undergo the Ross operation compared to those with right ventricular outflow tract obstruction. These factors include later age at operation (allowing for larger homografts), more normal pulmonary artery architecture, absence of severe right ventricular hypertrophy, and more natural positioning of the homograft. However, further systematic studies are needed to confirm these associations.`
`How does MCAM expression in hMSC affect the growth and maintenance of hematopoietic progenitors?`	`MCAM expression in hMSC has been shown to support the growth of hematopoietic progenitors. It enhances the adhesion and migration of HSPC, potentially through direct cell-cell interactions. However, the putative interaction partner of MCAM on HSPC remains unknown. Additionally, MCAM expression in hMSC does not seem to regulate the expression or secretion of SDF-1, a key factor in HSPC homing and maintenance.`
`What is the relationship between Fanconi anemia and breast and ovarian cancer susceptibility genes?`	Fanconi anemia is a rare, autosomal recessive syndrome characterized by chromosomal instability, cancer susceptibility, and hypersensitivity to DNA cross-linking agents. It has been found that all known Fanconi anemia proteins cooperate with breast and/or ovarian cancer susceptibility gene products (BRCA1 and BRCA2) in a pathway required for cellular resistance to DNA cross-linking agents. This pathway, known as the "Fanconi anemia-BRCA pathway," is a DNA damage-activated signaling pathway that controls DNA repair. Methylation of one of the Fanconi anemia genes, FANCF, can lead to the inactivation of this pathway in breast and ovarian cancer, suggesting its importance in human carcinogenesis.

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
    "lambda_corpus": 3e-05,
    "lambda_query": 5e-05
}

Evaluation Dataset

miriad-4.4_m-split

Dataset: miriad-4.4_m-split at 596b9ab
Size: 1,000 evaluation samples
Columns: question and answer
Approximate statistics based on the first 1000 samples:
question answer
type string string
details
min: 8 tokens
mean: 23.55 tokens
max: 74 tokens

min: 26 tokens
mean: 103.03 tokens
max: 262 tokens

	question	answer
type	string	string
details	min: 8 tokens mean: 23.55 tokens max: 74 tokens	min: 26 tokens mean: 103.03 tokens max: 262 tokens

Samples:

question	answer
`What are some hereditary cancer syndromes that can result in various forms of cancer?`	`Hereditary cancer syndromes, such as Hereditary Breast and Ovarian Cancer (HBOC) and Lynch Syndrome (LS), can result in various forms of cancer due to germline mutations in cancer predisposition genes. These syndromes are associated with an increased risk of developing specific types of cancer.`
`How do MAK-4 and MAK-5 exert their antioxidant properties?`	MAK-4 and MAK-5 have been shown to have antioxidant properties both in vitro and in vivo. These preparations contain multiple antioxidants such as alpha-tocopherol, beta-carotene, ascorbate, bioflavonoid, catechin, polyphenols, riboflavin, and tannic acid. These antioxidants are known to scavenge free radicals and reactive oxygen species (ROS) such as superoxide, hydroxyl, and peroxyl radicals, as well as hydrogen peroxide. In the present study, the antioxidant properties of MAK-4 and MAK-5 were confirmed in mice, with higher oxygen radical absorbance capacity (ORAC) values observed in mice fed the MAK-supplemented diet. Additionally, the activity of liver enzymes GPX, GST, and QR, which are involved in detoxification processes, were upregulated in the MAK-fed mice. This suggests that MAK-4 and MAK-5 may protect against carcinogenesis by reducing oxidative stress and enhancing detoxification processes.
`What are the primary indications for a decompressive craniectomy, and what role does neurocritical care play in determining the suitability of a patient for this procedure?`	The primary indications for a decompressive craniectomy include refractory intracranial pressure (ICP) and progressive neurological deterioration due to mass effect from conditions like head trauma, or ischemic or hemorrhagic cerebrovascular disease. Neurocritical care and ICP monitoring are essential in identifying suitable candidates for the procedure, as it is considered a rescue surgical technique. These measures help to assess the patient's condition and determine the need for decompressive craniectomy in cases of elevated ICP.

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
    "lambda_corpus": 3e-05,
    "lambda_query": 5e-05
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	miriad_eval_dot_ndcg@10	miriad_test_dot_ndcg@10
0.032	200	287.5421	-	-	-
0.064	400	0.1454	-	-	-
0.096	600	0.0469	-	-	-
0.128	800	0.0105	-	-	-
0.16	1000	0.0016	0.0016	0.9759	-
0.192	1200	0.0084	-	-	-
0.224	1400	0.0069	-	-	-
0.256	1600	0.0031	-	-	-
0.288	1800	0.0061	-	-	-
0.32	2000	0.0061	0.0006	0.9817	-
0.352	2200	0.0012	-	-	-
0.384	2400	0.0034	-	-	-
0.416	2600	0.0057	-	-	-
0.448	2800	0.0023	-	-	-
0.48	3000	0.0034	0.0005	0.9829	-
0.512	3200	0.0006	-	-	-
0.544	3400	0.002	-	-	-
0.576	3600	0.0025	-	-	-
0.608	3800	0.0008	-	-	-
0.64	4000	0.0019	0.0006	0.9834	-
0.672	4200	0.0106	-	-	-
0.704	4400	0.0084	-	-	-
0.736	4600	0.0035	-	-	-
0.768	4800	0.0016	-	-	-
0.8	5000	0.0037	0.0004	0.9860	-
0.832	5200	0.0044	-	-	-
0.864	5400	0.004	-	-	-
0.896	5600	0.0005	-	-	-
0.928	5800	0.0013	-	-	-
0.96	6000	0.0012	0.0005	0.9868	-
0.992	6200	0.0009	-	-	-
-1	-1	-	-	0.9867	0.9883

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 0.119 kWh
Carbon Emitted: 0.046 kg of CO2
Hours Used: 0.375 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 4.2.0.dev0
Transformers: 4.52.4
PyTorch: 2.6.0+cu124
Accelerate: 1.5.1
Datasets: 2.21.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
    }

tomaarsen
/

splade-distilbert-base-uncased-miriad-answers

DistilBERT base trained on MIRIAD question-answer tuples

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Sparse Information Retrieval

Training Details

Training Dataset

miriad-4.4_m-split

Evaluation Dataset

miriad-4.4_m-split

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Environmental Impact

Training Hardware

Framework Versions

Citation

BibTeX

Sentence Transformers

SpladeLoss

SparseMultipleNegativesRankingLoss

FlopsLoss

Model tree for tomaarsen/splade-distilbert-base-uncased-miriad-answers

Dataset used to train tomaarsen/splade-distilbert-base-uncased-miriad-answers

Evaluation results