InstaNovo-P: De novo Peptide Sequencing Model for Phosphoproteomics

Model Description

InstaNovo-P is a specialized transformer-based model for de novo peptide sequencing from phosphoproteomics mass spectrometry data. This model is specifically trained and optimized for identifying phosphorylated peptides and their modification sites. The model predicts peptide sequences directly from MS/MS spectra with enhanced capabilities for detecting and localizing phosphorylation sites, making it particularly valuable for phosphoproteomics studies and PTM discovery.

Usage

import torch
import numpy as np
import pandas as pd
from instanovo.transformer.model import InstaNovo
from instanovo.utils import SpectrumDataFrame
from instanovo.transformer.dataset import SpectrumDataset, collate_batch
from torch.utils.data import DataLoader
from instanovo.inference import ScoredSequence
from instanovo.inference import BeamSearchDecoder
from instanovo.utils.metrics import Metrics
from tqdm.notebook import tqdm

# Load the model from the Hugging Face Hub
model, config = InstaNovo.from_pretrained("InstaDeepAI/instanovo-phospho-v1.0.0")

# Move the model to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

# Update the residue set with custom modifications
model.residue_set.update_remapping(
    {
        "M(ox)": "M[UNIMOD:35]",
        "M(+15.99)": "M[UNIMOD:35]",
        "S(p)": "S[UNIMOD:21]",  # Phosphorylation
        "T(p)": "T[UNIMOD:21]",
        "Y(p)": "Y[UNIMOD:21]",
        "S(+79.97)": "S[UNIMOD:21]",
        "T(+79.97)": "T[UNIMOD:21]",
        "Y(+79.97)": "Y[UNIMOD:21]",
        "Q(+0.98)": "Q[UNIMOD:7]",  # Deamidation
        "N(+0.98)": "N[UNIMOD:7]",
        "Q(+.98)": "Q[UNIMOD:7]",
        "N(+.98)": "N[UNIMOD:7]",
        "C(+57.02)": "C[UNIMOD:4]",  # Carboxyamidomethylation
        "(+42.01)": "[UNIMOD:1]",  # Acetylation
        "(+43.01)": "[UNIMOD:5]",  # Carbamylation
        "(-17.03)": "[UNIMOD:385]",
    }
)

# Load the test data
sdf = SpectrumDataFrame.from_huggingface(
    "InstaDeepAI/InstaNovo-P",
    is_annotated=True,
    shuffle=False,
    split="test[:10%]",  # Let's only use a subset of the test data for faster inference
)

# Create the dataset
ds = SpectrumDataset(
    sdf,
    model.residue_set,
    config.get("n_peaks", 200),
    return_str=True,
    annotated=True,
)

# Create the data loader
dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)

# Create the decoder
decoder = BeamSearchDecoder(model=model)

# Initialize lists to store predictions and targets
preds = []
targs = []
probs = []

# Iterate over the data loader
for _, batch in tqdm(enumerate(dl), total=len(dl)):
    spectra, precursors, _, peptides, _ = batch
    spectra = spectra.to(device)
    precursors = precursors.to(device)

    # Perform inference
    with torch.no_grad():
        p = decoder.decode(
            spectra=spectra,
            precursors=precursors,
            beam_size=config["n_beams"],
            max_length=config["max_length"],
        )


    preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]
    probs += [
        x.sequence_log_probability if isinstance(x, ScoredSequence) else -float("inf") for x in p
    ]
    targs += list(peptides)

# Initialize metrics
metrics = Metrics(model.residue_set, config["isotope_error_range"])

# Compute precision and recall
aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
    peptides, preds
)

# Compute amino acid error rate and AUC
aa_error_rate = metrics.compute_aa_er(targs, preds)
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))

print(f"amino acid error rate:    {aa_error_rate:.5f}")
print(f"amino acid precision:     {aa_precision:.5f}")
print(f"amino acid recall:        {aa_recall:.5f}")
print(f"peptide precision:        {peptide_precision:.5f}")
print(f"peptide recall:           {peptide_recall:.5f}")
print(f"area under the PR curve:  {auc:.5f}")

For more explanation, see the Getting Started notebook in the repository.

Citation

If you use InstaNovo-P in your research, please cite:

@article {Lauridsen2025.05.14.654049,
    title = {InstaNovo-P: A de novo peptide sequencing model for phosphoproteomics},
    author = {Lauridsen, Jesper and Ramasamy, Pathmanaban and Catzel, Rachel and Canbay, Vahap 
        and Mabona, Amandla and Eloff, Kevin and Fullwood, Paul and Ferguson, Jennifer and 
        Kirketerp-M{\o}ller, Annekatrine and Goldschmidt, Ida Sofie and Claeys, Tine and van 
        Puyenbroeck, Sam and Lopez Carranza, Nicolas and Schoof, Erwin M. and Martens, Lennart and 
        Van Goey, Jeroen and Francavilla, Chiara and Jenkins, Timothy Patrick and Kalogeropoulos, 
        Konstantinos},
    elocation-id = {2025.05.14.654049},
    year = {2025},
    doi = {10.1101/2025.05.14.654049},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/05/18/2025.05.14.654049},
    eprint = {https://www.biorxiv.org/content/early/2025/05/18/2025.05.14.654049.full.pdf},
    journal = {bioRxiv}
}

For the general InstaNovo model, please cite:

@article{eloff_kalogeropoulos_2025_instanovo,
        title        = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
                        proteomics experiments},
        author       = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
                        Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
                        Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
                        and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
                        Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
                        Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
        year         = {2025},
        month        = {Mar},
        day          = {31},
        journal      = {Nature Machine Intelligence},
        doi          = {10.1038/s42256-025-01019-5},
        issn         = {2522-5839},
        url          = {https://doi.org/10.1038/s42256-025-01019-5}
}

Resources

License

  • Code: Licensed under Apache License 2.0
  • Model Checkpoints: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)

Installation

pip install instanovo

For GPU support, install with CUDA dependencies:

pip install instanovo[cu126]

Requirements

  • Python >= 3.10, < 3.13
  • PyTorch >= 1.13.0
  • CUDA (optional, for GPU acceleration)

Support

For questions, issues, or contributions, please visit the GitHub repository or check the documentation.

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train InstaDeepAI/instanovo-phospho-v1.0.0

Collection including InstaDeepAI/instanovo-phospho-v1.0.0