File size: 3,887 Bytes
7b2da59
 
0aa8f19
 
d30614e
0aa8f19
2a0af4b
0aa8f19
 
 
 
 
 
 
7b2da59
 
0aa8f19
7b2da59
bf58496
7b2da59
bf58496
 
 
 
 
 
 
 
 
7b2da59
0122e21
 
 
 
0aa8f19
7b2da59
d30614e
 
 
bf58496
d30614e
7b2da59
bf58496
 
 
7b2da59
0aa8f19
 
7b2da59
0aa8f19
7b2da59
bf58496
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
d7ae275
 
bf58496
0122e21
 
7b2da59
bf58496
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
 
7b2da59
0aa8f19
7b2da59
0122e21
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
bf58496
0aa8f19
7b2da59
896c05d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
library_name: transformers
tags:
- gene-ontology
- proteomics
datasets:
- andrewdalpino/AmiGO
metrics:
- precision
- recall
- f1
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification
---

# ESM2 Protein Function Caller

An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM2 Transformer architecture, pre-trained on [UniRef50](https://www.uniprot.org/help/uniref), and fine-tuned on the [AmiGO](https://huggingface.co/datasets/andrewdalpino/AmiGO) dataset, this model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

**Note**: This version only models the `cellular component` subgraph of the gene ontology.

## What are GO terms?

> "The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function."

> "GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)."

From [CAFA 5 Protein Function Prediction](https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/data)

## Code Repository

https://github.com/andrewdalpino/esm2-function-classifier

## Model Specs

- **Vocabulary Size**: 33
- **Embedding Dimensions**: 640
- **Attention Heads**: 20
- **Encoder Layers**: 30
- **Context Length**: 1026

## Basic Example

For a basic demonstration we can rank the GO terms for a particular sequence. For a more advanced example see the [predict-subgraph.py](https://github.com/andrewdalpino/esm2-function-classifier/blob/master/predict-subgraph.py) source file.

```python
import torch

from transformers import EsmTokenizer, EsmForSequenceClassification

model_name = "andrewdalpino/ESM2-35M-Protein-Molecular-Function"

tokenizer = EsmTokenizer.from_pretrained(model_name)

model = EsmForSequenceClassification.from_pretrained(model_name)

model.eval()

sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMM"

top_k = 10

out = tokenizer(sequence)

input_ids = out["input_ids"]

input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)

with torch.no_grad():
    outputs = model.forward(input_ids)

    probabilities = torch.sigmoid(outputs.logits.squeeze(0))

    probabilities, indices = torch.topk(probabilities, top_k)

probabilities = probabilities.tolist()

terms = [model.config.id2label[index] for index in indices.tolist()]

print(f"Top {args.top_k} GO Terms:")

for term, probability in zip(terms, probabilities):
    print(f"{probability:.4f}: {term}")
```

## References:

>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.