metadata
library_name: transformers
tags:
- gene-ontology
- proteomics
datasets:
- andrewdalpino/CAFA5
metrics:
- precision
- recall
- f1
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification
ESM2 Protein Function Caller
An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the CAFA 5 dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.
Note: This model specilizes on the celluar component
subgraph of the gene ontology.
Code Repository
https://github.com/andrewdalpino/esm2-function-classifier
Model Specs
- Vocabulary Size: 33
- Embedding Dimensions: 640
- Attention Heads: 20
- Hidden Layers: 30
- Context Length: 1026
- Total Parameters: 151M
Example Usage
import torch
from transformers import EsmTokenizer, EsmForSequenceClassification
model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component"
tokenizer = EsmTokenizer.from_pretrained(model_name)
model = EsmForSequenceClassification.from_pretrained(model_name)
model.eval()
sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH"
top_k = 10
out = tokenizer(
sequence,
max_length=1026,
truncation=True,
)
input_ids = out["input_ids"]
input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)
with torch.no_grad():
outputs = model.forward(input_ids)
probabilities = torch.sigmoid(outputs.logits.squeeze(0))
probabilities, indices = torch.topk(probabilities, top_k)
probabilities = probabilities.tolist()
terms = [model.config.id2label[index] for index in indices.tolist()]
print(f"Top {args.top_k} GO Terms:")
for term, probability in zip(terms, probabilities):
print(f"{probability:.4f}: {term}")
Training Results
- Epochs: 20
- Test F1: 0.63
- Test Precision: 0.78
- Test Recall: 0.53
References:
- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.