Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture
This repository presents a pretrained ESM-1b architecture, where the weights are initialized from scratch and trained using the Masked Language Modeling (MLM) objective. The training data consists of labeled phosphosites derived from DARKIN and and their Multiple Sequence Alignments (MSA).
Developed by:
Zeynep Işık (MSc, Sabanci University)
Training Details
Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Labeled phosphosites from DARKIN and their MSAs. Total Samples: 702,468 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 3.5 day
Pretraining Performance
Perplexity at Start: 12.32 Perplexity at End: 1.44 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.
Potential Usecases
This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites
- Downloads last month
- 53
Model tree for isikz/esm1b_msa_mlm_pt_phosphosite
Base model
facebook/esm1b_t33_650M_UR50S