File size: 2,745 Bytes
7b2da59
 
0aa8f19
 
d30614e
0aa8f19
 
 
 
 
 
 
 
 
7b2da59
 
0aa8f19
7b2da59
896c05d
7b2da59
273ecd0
7b2da59
0122e21
 
 
 
0aa8f19
7b2da59
d30614e
 
 
 
 
 
7b2da59
0aa8f19
7b2da59
0aa8f19
 
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
d7ae275
 
0122e21
 
 
7b2da59
0aa8f19
 
0122e21
0aa8f19
 
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
 
7b2da59
0aa8f19
7b2da59
0122e21
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
7b2da59
0aa8f19
 
 
7b2da59
0aa8f19
7b2da59
d30614e
 
 
896c05d
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
library_name: transformers
tags:
- gene-ontology
- proteomics
datasets:
- andrewdalpino/CAFA5
metrics:
- precision
- recall
- f1
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification
---

# ESM2 Protein Function Caller

An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the [CAFA 5](https://huggingface.co/datasets/andrewdalpino/CAFA5) dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

**Note**: This model specilizes on the `celluar component` subgraph of the gene ontology.

## Code Repository

https://github.com/andrewdalpino/esm2-function-classifier

## Model Specs

- **Vocabulary Size**: 33
- **Embedding Dimensions**: 640
- **Attention Heads**: 20
- **Hidden Layers**: 30
- **Context Length**: 1026
- **Total Parameters**: 151M

## Example Usage

```python
import torch

from transformers import EsmTokenizer, EsmForSequenceClassification

model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component"

tokenizer = EsmTokenizer.from_pretrained(model_name)

model = EsmForSequenceClassification.from_pretrained(model_name)

model.eval()

sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH"

top_k = 10

out = tokenizer(
    sequence,
    max_length=1026,
    truncation=True,
)

input_ids = out["input_ids"]

input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)

with torch.no_grad():
    outputs = model.forward(input_ids)

    probabilities = torch.sigmoid(outputs.logits.squeeze(0))

    probabilities, indices = torch.topk(probabilities, top_k)

probabilities = probabilities.tolist()

terms = [model.config.id2label[index] for index in indices.tolist()]

print(f"Top {args.top_k} GO Terms:")

for term, probability in zip(terms, probabilities):
print(f"{probability:.4f}: {term}")
```

## Training Results

- **Epochs**: 20
- **Test F1**: 0.63
- **Test Precision**: 0.78
- **Test Recall**: 0.53

## References:

>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
>- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.