Fill-Mask
Transformers
Safetensors
modernbert
smiles
chemistry
BERT
molecules

MolEncoder

MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules".

Model Description

  • Architecture: Encoder-only transformer based on ModernBERT
  • Parameters: ~15M
  • Tokenizer: Character-level tokenizer covering full SMILES vocabulary
  • Pretraining Objective: Masked language modeling with optimized masking ratios (30% found to work best for molecules)
  • Pretraining Data: Pretrained on ~1M molecules (half of ChEMBL)

Key Findings

  • Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.
  • Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.
  • This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.

Intended Uses

  • Primary use: Molecular property prediction through fine-tuning on downstream datasets

How to Use

Please refer to the MolEncoder GitHub repository for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.

Citation

If you use this model, please cite: Citation-will-be-inserted-soon

Downloads last month
13
Safetensors
Model size
15.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train fabikru/MolEncoder

Collection including fabikru/MolEncoder