ModChemBERT: ModernBERT as a Chemical Language Model

ModChemBERT-IR-BASE is a ModernBERT-based chemical language model (CLM) pretrained on SMILES strings using masked language modeling (MLM). This model serves as a base model for training embedding, retrieval, and reranking models for molecular information retrieval tasks.

Usage

Install the transformers library starting from v4.56.1:

pip install -U transformers>=4.56.1

Load Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "Derify/ModChemBERT-IR-BASE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    device_map="auto",
)

Fill-Mask Pipeline

from transformers import pipeline

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill("c1ccccc1[MASK]"))

Architecture

  • Backbone: ModernBERT [1]
  • Hidden size: 1024
  • Intermediate size: 1536
  • Encoder Layers: 22
  • Attention heads: 16
  • Max sequence length: 512 tokens
  • Tokenizer: BPE tokenizer using MolFormer's vocab (2362 tokens)

Dataset

Pooling (Classifier / Regressor Head)

Kallergis et al. [2] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters.

Behrendt et al. [3] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the max_seq_mha pooling method was particularly effective in low-data regimes.

This base model includes configurable pooling strategies for downstream fine-tuning. When fine-tuned for embedding, retrieval, or reranking tasks (e.g., with Sentence Transformers), various pooling methods can be explored:

  • cls: Last layer [CLS]
  • mean: Mean over last hidden layer
  • max_cls: Max over last k layers of [CLS]
  • cls_mha: MHA with [CLS] as query
  • max_seq_mha: MHA with max pooled sequence as KV and max pooled [CLS] as query
  • mean_seq_mha: MHA with mean pooled sequence as KV and mean pooled [CLS] as query
  • sum_mean: Sum over all layers then mean tokens
  • sum_sum: Sum over all layers then sum tokens
  • mean_mean: Mean over all layers then mean tokens
  • mean_sum: Mean over all layers then sum tokens
  • max_seq_mean: Max over last k layers then mean tokens

Note: ModChemBERT's cls_mha, max_seq_mha, and mean_seq_mha differ from MaxPoolBERT [3]. MaxPoolBERT uses PyTorch nn.MultiheadAttention, whereas ModChemBERT's ModChemBertPoolingAttention adapts ModernBERT's ModernBertAttention. On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with nn.MultiheadAttention. Training instability with ModernBERT has been reported in the past (discussion 1 and discussion 2).

Intended Use

  • Primary: Base model for training embedding, retrieval, and reranking models for chemical information retrieval tasks using frameworks such as Sentence Transformers.
  • Appropriate for: Fine-tuning for semantic search of chemical compounds, molecular similarity tasks, chemical information retrieval systems, and as a foundation for building chemical embedding models.
  • Not intended for: Direct molecular property prediction without fine-tuning, generating novel molecules, or production use without domain-specific validation.

Limitations

  • This is a base model pretrained only on masked language modeling; it requires fine-tuning for specific information retrieval tasks.
  • Performance on out-of-domain chemical spaces may vary: very long SMILES (>512 tokens), inorganic/organometallic compounds, polymers, or charged/enumerated tautomers may not be well represented in the training corpus.
  • The model reflects the chemical space distribution of PubChem and may not generalize equally well to all chemical domains.

Ethical Considerations & Responsible Use

  • This base model is intended for research and development purposes in chemical information retrieval.
  • When fine-tuned for downstream applications, users should validate performance on their specific domain and use case.
  • Do not deploy in clinical, regulatory, or safety-critical settings without rigorous domain-specific validation and appropriate oversight.

Hardware

Training was performed on two NVIDIA RTX 3090 GPUs using accelerate for distributed (DDP) training.

Citation

If you use ModChemBERT-IR-BASE in your research, please cite the checkpoint and the following:

@software{cortes-2025-modchembert,
  author = {Emmanuel Cortes},
  title = {ModChemBERT: ModernBERT as a Chemical Language Model},
  year = {2025},
  publisher = {GitHub},
  howpublished = {GitHub repository},
  url = {https://github.com/emapco/ModChemBERT}
}

References

  1. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
  2. Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4
  3. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).
Downloads last month
1,980
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Derify/ModChemBERT-IR-BASE

Finetunes
1 model

Collection including Derify/ModChemBERT-IR-BASE