---
license: apache-2.0
metrics:
- accuracy
- f1
base_model:
- facebook/esm1b_t33_650M_UR50S
---

## **Fine-Tuning ESM-1b for Phosphosite Prediction**

This repository provides a fine-tuned version of the [ESM-1b]([https://website-name.com](https://huggingface.co/facebook/esm1b_t33_650M_UR50S)) model, trained to classify phosphosites using unlabeled phosphosites(ie, which kinases phosphorylate those phosphosites is unknown) from [PhosphoSitePlus](https://www.phosphosite.org/staticDownloads). The model is designed for binary classification, distinguishing phosphosites from non-phosphorylated peptid sequences [(Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites)](https://www.sciencedirect.com/science/article/pii/S1535947620311518)

### **Developed by:**
Zeynep Işık (MSc, Sabanci University)
### **Dataset & Labeling Strategy**

The dataset was constructed using phosphosite information from **PhosphoSitePlus**, with the following assumptions:

- Positive Samples: Known phosphorylated residues from PhosphoSitePlus.
- Negative Samples: Derived by selecting 15-residue sequences from the same proteins, ensuring the central residue matches a known phosphorylation site but is not reported as phosphorylated in PhosphoSitePlus.
**Note**: The absence of phosphorylation reports does not imply absolute non-phosphorylation but is assumed as negative in this study.

### **Dataset Statistics**
- Positive Samples: 366,028
- Negative Samples: 364,121
- Training Samples: 511,104
- Validation Samples: 109,522
- Testing Samples: 109,523

### **Test Performance**
- Accuracy: 0.94
- F1-Score: 0.94

### **Usage**
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "isikz/phosphorylation_binaryclassification_esm1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example sequence
sequence = "MKTLLLTLVVVTIVCLDLGYTGV"

# Tokenize input
inputs = tokenizer(sequence, return_tensors="pt")

# Get prediction
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
prediction = torch.sigmoid(logits).item()

print(f"Phosphorylation Probability: {prediction:.4f}")

```