unikei
/

bert-base-smiles

Model card Files Files and versions

BERT base for SMILES

This is bidirectional transformer pretrained on SMILES (simplified molecular-input line-entry system) strings.

Example: Amoxicillin

O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C

Two training objectives were used:

masked language modeling
molecular-formula validity prediction

Intended uses

This model is primarily aimed at being fine-tuned on the following tasks:

molecule classification
molecule-to-gene-expression mapping
cell targeting

How to use in your code

from transformers import BertTokenizerFast, BertModel
checkpoint = 'unikei/bert-base-smiles'
tokenizer = BertTokenizerFast.from_pretrained(checkpoint)
model = BertModel.from_pretrained(checkpoint)

example = 'O=C([C@@H](c1ccc(cc1)O)N)N[C@@H]1C(=O)N2[C@@H]1SC([C@@H]2C(=O)O)(C)C'
tokens = tokenizer(example, return_tensors='pt')
predictions = model(**tokens)

Research

Jouary et al. (2025) Bridging scales between chemical space and behavioral phenotype:

A cross-modal mapping between behavior and molecular structure, derived using the unikei/bert-base-smiles model, effectively distinguished between distinct neurotransmitter classes, such as dopaminergic/serotonergic ligands, purines, and metabotropic glutamate ligands.

Downloads last month: 2,979

Safetensors

Model size

110M params

Tensor type

F32

·

Spaces using unikei/bert-base-smiles 3