Model Card for Model LED-Base-16384_Chemtagger
This model is part of this publication. It is used for translating chemical synthesis procedures given in natural language (en) to "action graphs", i.e., a simple markup language listing synthesis actions from a pre-defined controlled vocabulary along with the process parameters.
Model Details
Model Description
The model was fine-tuned on a dataset containing chemical synthesis procedures from the patent literature as input, and automatically generated annotations (action graphs) as output. The annotations were created using ChemicalTagger and rule-based post-processing of the ChemicalTagger output.
- Developed by: Bastian Ruehle
- Funded by: Federal Institute fo Materials Research and Testing (BAM)
- Model type: LED (Longformer Encoder-Decoder)
- Language(s) (NLP): en
- License: MIT
- Finetuned from model: allenai/led-base-16384
Model Sources
- Repository: The repository accompanying this model can be found here
- Paper: The papers accompanying this model can be found here and here
Uses
The model is integrated into a node editor app for generating workflows from synthesis procedures given in natural language for the Self-Driving Lab platform Minerva.
Direct Use
Even though it is not the intended way of using the model, it can be used "stand-alone" for creating action graphs from chemical synthesis procedures given in natural language (see below for a usage example).
Downstream Use
The model was intended to be used with the node editor app for the Self-Driving Lab platform Minerva.
Out-of-Scope Use
The model works best on synthesis procedures that are written in a style that is similar to the writing styles of synthesis procedures in patents and the experimental sections of scientific journals from the general fields of chemistry (organic, inorganic, materials science).
Bias, Risks, and Limitations
The model might produce inaccurate results for procedures from other fields or procedures that cross-reference other pocedures, generic recipes, etc.
Recommendations
Users (both direct and downstream) should always check the feasibility of the produced output before further processing it and running a chemical reaction based on the output.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import re
def preprocess(rawtext: str)->str:
rawtext = rawtext.replace('( ', '(').replace(' )', ')').replace('[ ', '[').replace(' ]', ']').replace(' . ', '. ').replace(' , ', ', ').replace(' : ', ': ').replace(' ; ', '; ').replace('\r', ' ').replace('\n', ' ').replace('\t', '').replace(' ', ' ')
rawtext = rawtext.replace('μ', 'u').replace('μ', 'u').replace('× ', 'x').replace('×', 'x')
for m in re.finditer(r'[0-9]x\s[0-9]', rawtext):
rawtext = rawtext.replace(m.group(), m.group().strip())
return rawtext
if __name__ == '__main__':
rawtext = """<Insert your Synthesis Procedure here>"""
# model_id = 'bruehle/BigBirdPegasus_Llama'
# model_id = 'bruehle/LED-Base-16384_Llama'
# model_id = 'bruehle/BigBirdPegasus_Chemtagger'
model_id = 'bruehle/LED-Base-16384_Chemtagger' # or use any of the other models
if 'BigBirdPegasus' in model_id:
max_length = 512
elif 'LED-Base-16384' in model_id:
max_length = 1024
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
pipe = pipeline('text2text-generation', model=model, tokenizer=tokenizer)
print(pipe(preprocess(rawtext), max_new_tokens=max_length, do_sample=False, temperature=None, top_p=None)[0]['generated_text'])
Training Details
Training Data
Models were trained on A100-80GB GPUs for 885’225 steps (5 epochs) on the training split, using a batch size of 8, an initial learning rate of 5*10-5 with a 0.05 warmup ratio, and a cosine weight decay. All other hyperparameters used the default values.
Training Procedure
Preprocessing
More information on data pre- and postprocessing can be found here.
Training Hyperparameters
- Training regime: fp32
Evaluation
Testing Data, Factors & Metrics
Testing Data
Example outputs for experimental procedures from the domains of materials science, organic chemistry, inorganic chemistry, and a patent that were not part of the training or evaluation dataset can be found here.
Technical Specifications
Model Architecture and Objective
Longformer Encoder-Decoder Model for Text2Text/Seq2Seq Generation.
Compute Infrastructure
Trained on HPC GPU nodes of the Federal Institute fo Materials Research and Testing (BAM).
Hardware
A100, 80 GB GPU, Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
Software
Python 3.12
Citation
BibTeX:
@article{Ruehle_2025, title={Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs}, DOI={10.26434/chemrxiv-2025-0p7xx}, journal={ChemRxiv}, author={Ruehle, Bastian}, year={2025}}
@article{doi:10.1021/acsnano.4c17504, author = {Zaki, Mohammad and Prinz, Carsten and Ruehle, Bastian}, title = {A Self-Driving Lab for Nano- and Advanced Materials Synthesis}, journal = {ACS Nano}, volume = {19}, number = {9}, pages = {9029-9041}, year = {2025}, doi = {10.1021/acsnano.4c17504}, note ={PMID: 39995288}, URL = {https://doi.org/10.1021/acsnano.4c17504}, eprint = {https://doi.org/10.1021/acsnano.4c17504}}
APA:
Ruehle, B. (2025). Natural Language Processing for Automated Workflow and Knowledge Graph Generation in Self-Driving Labs. ChemRxiv. doi:10.26434/chemrxiv-2025-0p7xx
Zaki, M., Prinz, C. & Ruehle, B. (2025). A Self-Driving Lab for Nano- and Advanced Materials Synthesis. ACS Nano, 19(9), 9029-9041. doi:10.1021/acsnano.4c17504
Model Card Authors
Bastian Ruehle
Model Card Contact
- Downloads last month
- 12
Model tree for bruehle/LED-Base-16384_Chemtagger
Base model
allenai/led-base-16384