---
library_name: transformers
tags:
- gene-ontology
- proteomics
datasets:
- andrewdalpino/CAFA5
metrics:
- precision
- recall
- f1
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: text-classification
---

# ESM2 Protein Function Caller

An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the [CAFA 5](https://huggingface.co/datasets/andrewdalpino/CAFA5) dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

**Note**: This model specilizes on the `celluar component` subgraph of the gene ontology.

## Code Repository

https://github.com/andrewdalpino/esm2-function-classifier

## Model Specs

- **Vocabulary Size**: 33
- **Embedding Dimensions**: 640
- **Attention Heads**: 20
- **Hidden Layers**: 30
- **Context Length**: 1026
- **Total Parameters**: 151M

## Example Usage

```python
import torch

from transformers import EsmTokenizer, EsmForSequenceClassification

model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component"

tokenizer = EsmTokenizer.from_pretrained(model_name)

model = EsmForSequenceClassification.from_pretrained(model_name)

model.eval()

sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH"

top_k = 10

out = tokenizer(
    sequence,
    max_length=1026,
    truncation=True,
)

input_ids = out["input_ids"]

input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0)

with torch.no_grad():
    outputs = model.forward(input_ids)

    probabilities = torch.sigmoid(outputs.logits.squeeze(0))

    probabilities, indices = torch.topk(probabilities, top_k)

probabilities = probabilities.tolist()

terms = [model.config.id2label[index] for index in indices.tolist()]

print(f"Top {args.top_k} GO Terms:")

for term, probability in zip(terms, probabilities):
print(f"{probability:.4f}: {term}")
```

## Training Results

- **Epochs**: 20
- **Test F1**: 0.63
- **Test Precision**: 0.78
- **Test Recall**: 0.53

## References:

>- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021.
>- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022.
>- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022.
>- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023.
>- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.