File size: 2,745 Bytes
7b2da59 0aa8f19 d30614e 0aa8f19 7b2da59 0aa8f19 7b2da59 896c05d 7b2da59 273ecd0 7b2da59 0122e21 0aa8f19 7b2da59 d30614e 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 d7ae275 0122e21 7b2da59 0aa8f19 0122e21 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0122e21 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 0aa8f19 7b2da59 d30614e 896c05d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
library_name: transformers
tags:
- gene-ontology
- proteomics
datasets:
- andrewdalpino/CAFA5
metrics:
- precision
- recall
- f1
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification
---
# ESM2 Protein Function Caller
An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the [CAFA 5](https://huggingface.co/datasets/andrewdalpino/CAFA5) dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.
**Note**: This model specilizes on the `celluar component` subgraph of the gene ontology.
## Code Repository
https://github.com/andrewdalpino/esm2-function-classifier
## Model Specs
- **Vocabulary Size**: 33
- **Embedding Dimensions**: 640
- **Attention Heads**: 20
- **Hidden Layers**: 30
- **Context Length**: 1026
- **Total Parameters**: 151M
## Example Usage
```python
import torch
from transformers import EsmTokenizer, EsmForSequenceClassification
model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component"
tokenizer = EsmTokenizer.from_pretrained(model_name)
model = EsmForSequenceClassification.from_pretrained(model_name)
model.eval()
sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH"
top_k = 10
out = tokenizer(
sequence,
max_length=1026,
truncation=True,
)
input_ids = out["input_ids"]
input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)
with torch.no_grad():
outputs = model.forward(input_ids)
probabilities = torch.sigmoid(outputs.logits.squeeze(0))
probabilities, indices = torch.topk(probabilities, top_k)
probabilities = probabilities.tolist()
terms = [model.config.id2label[index] for index in indices.tolist()]
print(f"Top {args.top_k} GO Terms:")
for term, probability in zip(terms, probabilities):
print(f"{probability:.4f}: {term}")
```
## Training Results
- **Epochs**: 20
- **Test F1**: 0.63
- **Test Precision**: 0.78
- **Test Recall**: 0.53
## References:
>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
>- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000. |