BioClinical ModernBERT Embeddings
This is a BioClinical ModernBERT model fined-tuned using sentence-transformers. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The training dataset was generated using a random sample of PubMed title-abstract pairs along with similar title pairs.
Given that this is a ModernBERT model, it supports a longer context length of up to 8,192
tokens vs the 512
tokens supported by standard BERT models.
Usage (txtai)
This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
import txtai
embeddings = txtai.Embeddings(path="neuml/bioclinical-modernbert-base-embeddings", content=True)
embeddings.index(documents())
# Run a query
embeddings.search("query to run")
Usage (Sentence-Transformers)
Alternatively, the model can be loaded with sentence-transformers.
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer("neuml/bioclinical-modernbert-base-embeddings")
embeddings = model.encode(sentences)
print(embeddings)
Usage (Hugging Face Transformers)
The model can also be used directly with Transformers.
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def meanpooling(output, mask):
embeddings = output[0] # First element of model_output contains all token embeddings
mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("neuml/bioclinical-modernbert-base-embeddings")
model = AutoModel.from_pretrained("neuml/bioclinical-modernbert-base-embeddings")
# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
output = model(**inputs)
# Perform pooling. In this case, mean pooling.
embeddings = meanpooling(output, inputs['attention_mask'])
print("Sentence embeddings:")
print(embeddings)
Evaluation Results
Performance of this model compared to the top base models on the MTEB leaderboard is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
The following datasets were used to evaluate model performance.
- PubMed QA
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- PubMed Subset
- Split: test, Pair: (title, text)
- PubMed Summary
- Subset: pubmed, Split: validation, Pair: (article, abstract)
Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.
Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
---|---|---|---|---|
all-MiniLM-L6-v2 | 90.40 | 95.92 | 94.07 | 93.46 |
bioclinical-modernbert-base-embeddings | 92.49 | 97.13 | 97.04 | 95.54 |
bge-base-en-v1.5 | 91.02 | 95.82 | 94.49 | 93.78 |
gte-base | 92.97 | 96.90 | 96.24 | 95.37 |
pubmedbert-base-embeddings | 93.27 | 97.00 | 96.58 | 95.62 |
S-PubMedBert-MS-MARCO | 90.86 | 93.68 | 93.54 | 92.69 |
Note that while this model scores slightly lower than the PubMedBERT embeddings model, it supports a longer context of 8,192
tokens vs 512
.
Training
The model was trained with the parameters:
DataLoader:
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss:
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss
with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit() method:
{
"epochs": 1,
"evaluation_steps": 2000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 10000,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
More Information
This model was trained using the same method described in this article.
- Downloads last month
- 0
Model tree for NeuML/bioclinical-modernbert-base-embeddings
Base model
answerdotai/ModernBERT-base