--- library_name: transformers tags: - gene-ontology - proteomics datasets: - andrewdalpino/CAFA5 metrics: - precision - recall - f1 base_model: - facebook/esm2_t30_150M_UR50D pipeline_tag: text-classification --- # ESM2 Protein Function Caller An Evolutionary-scale Model (ESM) for protein function calling from amino acid sequences. Based on the ESM2 Transformer architecture and fine-tuned on the [CAFA 5](https://huggingface.co/datasets/andrewdalpino/CAFA5) dataset, this model predicts the gene ontology (GO) subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell. **Note**: This model specilizes on the `celluar component` subgraph of the gene ontology. ## Code Repository https://github.com/andrewdalpino/esm2-function-classifier ## Model Specs - **Vocabulary Size**: 33 - **Embedding Dimensions**: 640 - **Attention Heads**: 20 - **Hidden Layers**: 30 - **Context Length**: 1026 - **Total Parameters**: 151M ## Example Usage ```python import torch from transformers import EsmTokenizer, EsmForSequenceClassification model_name = "andrewdalpino/ESM2-150M-Protein-Cellular-Component" tokenizer = EsmTokenizer.from_pretrained(model_name) model = EsmForSequenceClassification.from_pretrained(model_name) model.eval() sequence = "MCNAWYISVDFEKNREDKSKCIHTRRNSGPKLLEHVMYEVLRDWYCLEGENVYMMGKKWQMPMCSLH" top_k = 10 out = tokenizer( sequence, max_length=1026, truncation=True, ) input_ids = out["input_ids"] input_ids = torch.tensor(input_ids, dtype=torch.int64).unsqueeze(0) with torch.no_grad(): outputs = model.forward(input_ids) probabilities = torch.sigmoid(outputs.logits.squeeze(0)) probabilities, indices = torch.topk(probabilities, top_k) probabilities = probabilities.tolist() terms = [model.config.id2label[index] for index in indices.tolist()] print(f"Top {args.top_k} GO Terms:") for term, probability in zip(terms, probabilities): print(f"{probability:.4f}: {term}") ``` ## Training Results - **Epochs**: 20 - **Test F1**: 0.63 - **Test Precision**: 0.78 - **Test Recall**: 0.53 ## References: >- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, 2021. >- Z. Lin, et al. Evolutionary-scale prediction of atomic level protein structure with a language model, 2022. >- G. A. Merino, et al. Hierarchical deep learning for predicting GO annotations by integrating protein knowledge, 2022. >- I. Friedberg, et al. CAFA 5 Protein Function Prediction. https://kaggle.com/competitions/cafa-5-protein-function-prediction, 2023. >- M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.