Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture

This repository presents a pretrained ESM-1b architecture, where the weights are initialized from scratch and trained using the Masked Language Modeling (MLM) objective. The training data consists of labeled phosphosites derived from DARKIN and and their Multiple Sequence Alignments (MSA).

Developed by:

Zeynep Işık (MSc, Sabanci University)

Training Details

Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Labeled phosphosites from DARKIN and their MSAs. Total Samples: 702,468 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 3.5 day

Pretraining Performance

Perplexity at Start: 12.32 Perplexity at End: 1.44 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.

Potential Usecases

This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites

Downloads last month
53
Safetensors
Model size
652M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for isikz/esm1b_msa_mlm_pt_phosphosite

Finetuned
(5)
this model