Membrizard's picture
Update README.md
93112c6 verified
|
raw
history blame
6.33 kB
metadata
license: cc-by-nc-nd-4.0
extra_gated_prompt: >-
  By submitting any personal information (e.g., name, contact details), you
  agree to the collection and processing of this data for the purpose of
  evaluating access requests for this model. Repository authors will store this
  data securely and will not share it with third parties without your explicit
  consent. You retain all rights to your personal information and may request
  its deletion at any time.

   By accessing the repository you agree not to use this model in experiments which may result in harm to human or animal subjects. 
extra_gated_fields:
  Date of Agreement: date_picker
  I accept the terms of the license and I agree not to use this model for commercial purposes or profit generation: checkbox
tags:
  - molecular-generation
  - diffusion-models
  - cheminformatics
  - 3D-conformer
  - rdkit
  - non-commercial
language: en
library_name: mlconfgen
datasets:
  - ChEMBL
metrics:
  - shape-tanimoto
  - validity
  - uniqueness
  - novelty
  - Fréchet Distance
model-index:
  - name: ML Conformer Generator
    results:
      - task:
          type: molecular-generation
          name: 3D Conformer Generation
        dataset:
          name: ChEMBL (filtered)
          type: molecules
        metrics:
          - name: Valid molecules
            type: validity
            value: 48%
          - name: Chemical novelty
            type: novelty
            value: 99.84%
          - name: Shape Tanimoto Similarity (avg)
            type: shape-tanimoto
            value: 53.32%
          - name: Shape Tanimoto Similarity (max)
            type: shape-tanimoto
            value: 99.69%
          - name: Unique molecules
            type: uniqueness
            value: 99.94%
          - name: Fréchet Fingerprint Distance
            type: Fréchet Distance
            value: 4.13

ML Conformer Generator

ML Conformer Generator is a shape-constrained molecule generation model that combines an Equivariant Diffusion Model (EDM) and Graph Convolutional Network (GCN). It generates 3D conformations that are chemically valid and geometrically aligned with a reference shape.


📦 Model Summary

  • Architecture: Equivariant Diffusion Model (EDM) + Graph Convolutional Network (GCN)
  • Training Data: 1.6 million ChEMBL compounds, filtered for molecules with 15–39 heavy atoms
  • Post-Processing: Deterministic standardization pipeline using RDKit with constrained MMFF94 geometry optimization
  • Primary Metric: Shape Tanimoto Similarity
  • Developed by: Denis Sapegin

🚀 Intended Use

  • Non-Commercial Research in 3D molecular generation
  • Academic/educational use
  • Generation of molecules similar to a reference conformer
  • Generation of molecules similar to a reference arbitrary shape

🚫 Out of Scope / Limitations

  • Commercial Use: Not licensed for commercial use without explicit permission.
  • Training Bias: Trained on ChEMBL data — results may be biased toward drug-like molecules and chemistries.
  • Elements Supported: Only the following elements are supported for generation: H, C, N, O, F, P, S, Cl, Br.
  • Molecular Size Limitations:
    • Trained on molecules containing 15–39 heavy atoms.
    • By architectural design, the model can only generate molecules with up to 42 heavy atoms.

🧪 Evaluation Metrics (100,000 requested samples, 100 denoising steps)

  • Valid molecules (post-standardization, % from requested): 48%
  • 🧬 Chemical novelty: 99.84%
  • 📐 Avg Shape Tanimoto: 53.32%
  • 🎯 Max Shape Tanimoto: 99.69%
  • 🔁 Unique molecules: 99.94%
  • Generation speed: 4.18 valid molecules/sec (NVIDIA H100)
  • 💾 Memory (per thread): up to 4.0 GB
  • 🧬 Fréchet Fingerprint Distance (to ChEMBL): 4.13

🧠 How It Works

Core Components:

  • EDM generates atom coordinates and types under shape constraints
  • GCN predicts adjacency matrices (bonding)
  • RDKit pipeline enforces valence, performs sanitization, and optimizes geometry

Shape Alignment:

Evaluated using Gaussian molecular volume overlap and Shape Tanimoto Similarity.

Hydrogens are excluded from similarity computation.


💾 Access & Licensing

The Python package and inference code are available on GitHub under Apache 2.0 License

https://github.com/Membrizard/ml_conformer_generator

The trained model Weights are available at

https://huggingface.co/Membrizard/ml_conformer_generator

And are licensed under CC BY-NC-ND 4.0

The usage of the trained weights for any profit-generating activity is restricted.

For commercial licensing and inference-as-a-service, contact: Denis Sapegin


Installation

  1. Install the package:

pip install mlconfgen

  1. Load the weights from Huggingface

    https://huggingface.co/Membrizard/ml_conformer_generator

PyTorch

edm_moi_chembl_15_39.pt

adj_mat_seer_chembl_15_39.pt

ONNX

edm_moi_chembl_15_39.onnx

adj_mat_seer_chembl_15_39.onnx


🐍 Python API

PyTorch

from rdkit import Chem
from mlconfgen import MLConformerGenerator, evaluate_samples

model = MLConformerGenerator(
                              edm_weights="./edm_moi_chembl_15_39.pt",
                              adj_mat_seer_weights="./adj_mat_seer_chembl_15_39.pt",
                              diffusion_steps=100,
                            )

reference = Chem.MolFromMolFile('ceyyag.mol')

samples = model.generate_conformers(reference_conformer=reference, n_samples=20, variance=2)

aligned_reference, std_samples = evaluate_samples(reference, samples)

ONNX

from mlconfgen import MLConformerGeneratorONNX
from rdkit import Chem

model = MLConformerGeneratorONNX(
                                 egnn_onnx="./egnn_chembl_15_39.onnx",
                                 adj_mat_seer_onnx="./adj_mat_seer_chembl_15_39.onnx",
                                 diffusion_steps=100,
                                )

reference = Chem.MolFromMolFile('ceyyag.mol')
samples = model.generate_conformers(reference_conformer=reference, n_samples=20, variance=2)