isikz/esm1b_msa_mlm_pt_phosphosite

Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture

This repository presents a pretrained ESM-1b architecture, where the weights are initialized from scratch and trained using the Masked Language Modeling (MLM) objective. The training data consists of labeled phosphosites derived from DARKIN and and their Multiple Sequence Alignments (MSA).

Developed by:

Zeynep Işık (MSc, Sabanci University)

Training Details

Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Labeled phosphosites from DARKIN and their MSAs. Total Samples: 702,468 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 3.5 day

Pretraining Performance

Perplexity at Start: 12.32 Perplexity at End: 1.44 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.

Potential Usecases

This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites

isikz
/

esm1b_msa_mlm_pt_phosphosite

Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture

Developed by:

Training Details

Pretraining Performance

Potential Usecases

Model tree for isikz/esm1b_msa_mlm_pt_phosphosite