arxiv:2506.10896

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Published on Jun 12

Authors:

Abstract

BioClinical ModernBERT, an encoder-based transformer model, enhances biomedical and clinical NLP through continued pretraining, long-context processing, and improvements in speed and performance across diverse datasets and tasks.

AI-generated summary

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

View arXiv page View PDF Add to collection

Community

stefan-it

about 21 hours ago

Very interesting paper and downstream task results, particulary for NER!

@thomas-sounack did you modified btw. some pretraining parameters compared to the original ModernBERT? I am thinking of the RoPE theta for example.

thomas-sounack

about 8 hours ago

Hi @stefan-it , thanks for the feedback!
We did not modify RoPE theta. Overall, our training hyperparameters were very similar, the only thing that we changed is lowering the masking ratio during the decay phase (referred to as Phase 2 in our paper).
This is due to the nature of the WSD schedule of ModernBERT, you can take any checkpoint to continue training on your data without cold restart issues, but the training parameters should be similar.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.10896 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.10896 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.