--- license: apache-2.0 metrics: - accuracy - f1 base_model: - facebook/esm1b_t33_650M_UR50S --- ## **Fine-Tuning ESM-1b for Phosphosite Prediction** This repository provides a fine-tuned version of the [ESM-1b]([https://website-name.com](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) model, trained to classify phosphosites using unlabeled phosphosites(ie, which kinases phosphorylate those phosphosites is unknown) from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads). The model is designed for binary classification, distinguishing phosphosites from non-phosphorylated peptid sequences [(Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites)](https://www.sciencedirect.com/science/article/pii/S1535947620311518) ### **Developed by:** Zeynep Işık (MSc, Sabanci University) ### **Dataset & Labeling Strategy** The dataset was constructed using phosphosite information from **PhosphoSitePlus**, with the following assumptions: - Positive Samples: Known phosphorylated residues from PhosphoSitePlus. - Negative Samples: Derived by selecting 15-residue sequences from the same proteins, ensuring the central residue matches a known phosphorylation site but is not reported as phosphorylated in PhosphoSitePlus. **Note**: The absence of phosphorylation reports does not imply absolute non-phosphorylation but is assumed as negative in this study. ### **Dataset Statistics** - Positive Samples: 366,028 - Negative Samples: 364,121 - Training Samples: 511,104 - Validation Samples: 109,522 - Testing Samples: 109,523 ### **Test Performance** - Accuracy: 0.94 - F1-Score: 0.94 ### **Usage** ``` from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the model and tokenizer model_name = "isikz/phosphorylation_binaryclassification_esm1b" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example sequence sequence = "MKTLLLTLVVVTIVCLDLGYTGV" # Tokenize input inputs = tokenizer(sequence, return_tensors="pt") # Get prediction with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits prediction = torch.sigmoid(logits).item() print(f"Phosphorylation Probability: {prediction:.4f}") ```