andrewdalpino's picture
Update README.md
896c05d verified
|
raw
history blame
2.75 kB
metadata
library_name: transformers
tags:
  - gene-ontology
  - proteomics
datasets:
  - andrewdalpino/CAFA5
metrics:
  - precision
  - recall
  - f1
base_model:
  - facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification

ESM2 Protein Function Caller

An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the CAFA 5 dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

Note: This model specilizes on the celluar component subgraph of the gene ontology.

Code Repository

https://github.com/andrewdalpino/esm2-function-classifier

Model Specs

  • Vocabulary Size: 33
  • Embedding Dimensions: 640
  • Attention Heads: 20
  • Hidden Layers: 30
  • Context Length: 1026
  • Total Parameters: 151M

Example Usage

import torch

from transformers import EsmTokenizer, EsmForSequenceClassification

model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component"

tokenizer = EsmTokenizer.from_pretrained(model_name)

model = EsmForSequenceClassification.from_pretrained(model_name)

model.eval()

sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH"

top_k = 10

out = tokenizer(
    sequence,
    max_length=1026,
    truncation=True,
)

input_ids = out["input_ids"]

input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)

with torch.no_grad():
    outputs = model.forward(input_ids)

    probabilities = torch.sigmoid(outputs.logits.squeeze(0))

    probabilities, indices = torch.topk(probabilities, top_k)

probabilities = probabilities.tolist()

terms = [model.config.id2label[index] for index in indices.tolist()]

print(f"Top {args.top_k} GO Terms:")

for term, probability in zip(terms, probabilities):
print(f"{probability:.4f}: {term}")

Training Results

  • Epochs: 20
  • Test F1: 0.63
  • Test Precision: 0.78
  • Test Recall: 0.53

References:

  • A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
  • Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
  • G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
  • I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
  • M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.