ModChemBERT: ModernBERT as a Chemical Language Model

ModChemBERT-IR-BASE is a ModernBERT-based chemical language model (CLM) pretrained on SMILES strings using masked language modeling (MLM). This model serves as a base model for training embedding, retrieval, and reranking models for molecular information retrieval tasks.

Usage

Install the transformers library starting from v4.56.1:

pip install -U transformers>=4.56.1

Load Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "Derify/ModChemBERT-IR-BASE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="auto",
)

Fill-Mask Pipeline

from transformers import pipeline

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill("c1ccccc1[MASK]"))

Architecture

Backbone: ModernBERT [1]
Hidden size: 1024
Intermediate size: 1536
Encoder Layers: 22
Attention heads: 16
Max sequence length: 512 tokens
Tokenizer: BPE tokenizer using MolFormer's vocab (2362 tokens)

Dataset

Pretraining: PubChem 110M dataset (canonical SMILES strings)

Pooling (Classifier / Regressor Head)

Kallergis et al. [2] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters.

Behrendt et al. [3] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the max_seq_mha pooling method was particularly effective in low-data regimes.

This base model includes configurable pooling strategies for downstream fine-tuning. When fine-tuned for embedding, retrieval, or reranking tasks (e.g., with Sentence Transformers), various pooling methods can be explored:

cls: Last layer [CLS]
mean: Mean over last hidden layer
max_cls: Max over last k layers of [CLS]
cls_mha: MHA with [CLS] as query
max_seq_mha: MHA with max pooled sequence as KV and max pooled [CLS] as query
mean_seq_mha: MHA with mean pooled sequence as KV and mean pooled [CLS] as query
sum_mean: Sum over all layers then mean tokens
sum_sum: Sum over all layers then sum tokens
mean_mean: Mean over all layers then mean tokens
mean_sum: Mean over all layers then sum tokens
max_seq_mean: Max over last k layers then mean tokens

Note: ModChemBERT's cls_mha, max_seq_mha, and mean_seq_mha differ from MaxPoolBERT [3]. MaxPoolBERT uses PyTorch nn.MultiheadAttention, whereas ModChemBERT's ModChemBertPoolingAttention adapts ModernBERT's ModernBertAttention. On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with nn.MultiheadAttention. Training instability with ModernBERT has been reported in the past (discussion 1 and discussion 2).

Intended Use

Primary: Base model for training embedding, retrieval, and reranking models for chemical information retrieval tasks using frameworks such as Sentence Transformers.
Appropriate for: Fine-tuning for semantic search of chemical compounds, molecular similarity tasks, chemical information retrieval systems, and as a foundation for building chemical embedding models.
Not intended for: Direct molecular property prediction without fine-tuning, generating novel molecules, or production use without domain-specific validation.

Limitations

This is a base model pretrained only on masked language modeling; it requires fine-tuning for specific information retrieval tasks.
Performance on out-of-domain chemical spaces may vary: very long SMILES (>512 tokens), inorganic/organometallic compounds, polymers, or charged/enumerated tautomers may not be well represented in the training corpus.
The model reflects the chemical space distribution of PubChem and may not generalize equally well to all chemical domains.

Ethical Considerations & Responsible Use

This base model is intended for research and development purposes in chemical information retrieval.
When fine-tuned for downstream applications, users should validate performance on their specific domain and use case.
Do not deploy in clinical, regulatory, or safety-critical settings without rigorous domain-specific validation and appropriate oversight.

Hardware

Training was performed on two NVIDIA RTX 3090 GPUs using accelerate for distributed (DDP) training.

Citation

If you use ModChemBERT-IR-BASE in your research, please cite the checkpoint and the following:

@software{cortes-2025-modchembert,
  author = {Emmanuel Cortes},
  title = {ModChemBERT: ModernBERT as a Chemical Language Model},
  year = {2025},
  publisher = {GitHub},
  howpublished = {GitHub repository},
  url = {https://github.com/emapco/ModChemBERT}
}

References

Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4
Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).

Downloads last month: 1,980

Safetensors

Model size

0.2B params

Tensor type

BF16

Model tree for Derify/ModChemBERT-IR-BASE

Finetunes

1 model

Collection including Derify/ModChemBERT-IR-BASE

ModChemBERT

Collection

ModernBERT as a Chemical Language Model • 6 items • Updated 13 days ago • 1