Model Card for Model BigBirdPegasus_Chemtagger

This model is part of this publication. It is used for translating chemical synthesis procedures given in natural language (en) to "action graphs", i.e., a simple markup language listing synthesis actions from a pre-defined controlled vocabulary along with the process parameters.

Model Details

Model Description

The model was fine-tuned on a dataset containing chemical synthesis procedures from the patent literature as input, and automatically generated annotations (action graphs) as output. The annotations were created using ChemicalTagger and rule-based post-processing of the ChemicalTagger output.

Developed by: Bastian Ruehle
Funded by: Federal Institute fo Materials Research and Testing (BAM)
Model type: BigBirdPegasus
Language(s) (NLP): en
License: MIT
Finetuned from model: google/bigbird-pegasus-large-bigpatent

Model Sources

Repository: The repository accompanying this model can be found here
Paper: The papers accompanying this model can be found here and here

Uses

The model is integrated into a node editor app for generating workflows from synthesis procedures given in natural language for the Self-Driving Lab platform Minerva.

Direct Use

Even though it is not the intended way of using the model, it can be used "stand-alone" for creating action graphs from chemical synthesis procedures given in natural language (see below for a usage example).

Downstream Use

The model was intended to be used with the node editor app for the Self-Driving Lab platform Minerva.

Out-of-Scope Use

The model works best on synthesis procedures that are written in a style that is similar to the writing styles of synthesis procedures in patents and the experimental sections of scientific journals from the general fields of chemistry (organic, inorganic, materials science).

Bias, Risks, and Limitations

The model might produce inaccurate results for procedures from other fields or procedures that cross-reference other pocedures, generic recipes, etc.

Recommendations

Users (both direct and downstream) should always check the feasibility of the produced output before further processing it and running a chemical reaction based on the output.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import re


def preprocess(rawtext: str)->str:
    rawtext = rawtext.replace('( ', '(').replace(' )', ')').replace('[ ', '[').replace(' ]', ']').replace(' . ', '. ').replace(' , ', ', ').replace(' : ', ': ').replace(' ; ', '; ').replace('\r', ' ').replace('\n', ' ').replace('\t', '').replace('  ', ' ')
    rawtext = rawtext.replace('μ', 'u').replace('μ', 'u').replace('× ', 'x').replace('×', 'x')
    for m in re.finditer(r'[0-9]x\s[0-9]', rawtext):
        rawtext = rawtext.replace(m.group(), m.group().strip())
    return rawtext


if __name__ == '__main__':
    rawtext = """<Insert your Synthesis Procedure here>"""

    # model_id = 'bruehle/BigBirdPegasus_Llama'
    # model_id = 'bruehle/LED-Base-16384_Llama'
    model_id = 'bruehle/BigBirdPegasus_Chemtagger'  # or use any of the other models
    # model_id = 'bruehle/LED-Base-16384_Chemtagger'
    
    if 'BigBirdPegasus' in model_id:
        max_length = 512
    elif 'LED-Base-16384' in model_id:
        max_length = 1024
    
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map='auto')
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)

    print(pipe(preprocess(rawtext), max_new_tokens=max_length, do_sample=False, temperature=None, top_p=None)[0]['generated_text'])

Training Details

Training Data

Models were trained on A100-80GB GPUs for 885’225 steps (5 epochs) on the training split, using a batch size of 8, an initial learning rate of 5*10-5 with a 0.05 warmup ratio, and a cosine weight decay. All other hyperparameters used the default values.

Training Procedure

Preprocessing

More information on data pre- and postprocessing can be found here.

Training Hyperparameters

Training regime: fp32

Evaluation

Testing Data, Factors & Metrics

Testing Data

Example outputs for experimental procedures from the domains of materials science, organic chemistry, inorganic chemistry, and a patent that were not part of the training or evaluation dataset can be found here.

Technical Specifications

Model Architecture and Objective

BigBirdPegasus-Large Model for Text2Text/Seq2Seq Generation.

Compute Infrastructure

Trained on HPC GPU nodes of the Federal Institute fo Materials Research and Testing (BAM).

Hardware

A100, 80 GB GPU, Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz

Software

Python 3.12

Citation

BibTeX:

@article{Ruehle_2025, title={Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs}, DOI={10.1039/D5DD00063G}, journal={DigitalDiscovery}, author={Ruehle, Bastian}, year={2025}}

@article{doi:10.1021/acsnano.4c17504, author = {Zaki, Mohammad and Prinz, Carsten and Ruehle, Bastian}, title = {A Self-Driving Lab for Nano- and Advanced Materials Synthesis}, journal = {ACS Nano}, volume = {19}, number = {9}, pages = {9029-9041}, year = {2025}, doi = {10.1021/acsnano.4c17504}, note ={PMID: 39995288}, URL = {https://doi.org/10.1021/acsnano.4c17504}, eprint = {https://doi.org/10.1021/acsnano.4c17504}}

APA:

Ruehle, B. (2025). Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs. DigitalDiscovery. doi:10.1039/D5DD00063G

Zaki, M., Prinz, C. & Ruehle, B. (2025). A Self-Driving Lab for Nano- and Advanced Materials Synthesis. ACS Nano, 19(9), 9029-9041. doi:10.1021/acsnano.4c17504

Model Card Authors

Bastian Ruehle

Model Card Contact

[email protected]

bruehle
/

BigBirdPegasus_Chemtagger