--- license: mit datasets: - ZINC-22 language: - en tags: - molecular-generation - drug-discovery - llama - flash-attention pipeline_tag: text-generation --- # NovoMolGen NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks. ## How to load ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True) ``` ## Quickstart ```python outputs = model.sample(tokenizer=tokenizer, batch_size=4) print(outputs['SMILES']) ``` ## Citation ```bibtex @article{chitsaz2024novomolgen, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath}, journal={arXiv preprint}, year={2025}, } ```