LoRA-DR-suite

Model details

LoRA-DR-suite is a family of models for the identification of disordered regions (DR) in proteins, built upon state-of-the-art Protein Language Models (PLMs) trained on protein sequences only. They leverage Low-Rank Adaptation (LoRA) fine-tuning for binary classification of intrinsic and soft disorder.
Intrinsically-disordered residues are experimentally detected through circular dichroism and X-ray cristallography, while soft disorder is characterized by high B-factor, or intermittently missing residues across different X-ray crystal structures of the same sequence.

Models for intrinsic disorder are trained on DisProt 7.0 data only (DisProt7 suffix) or on additional data from the first and second edition of the Critical Assesment of Intrinsic Disorder (CAID), indicated with the ID suffix.

Models for soft disorder classification are trained instead on the SoftDis dataset, derived from an extensive analysis of clusters of alternative structures for the same protein sequence in the Protein Data Bank (PDB). For each position in the represantitive sequence of each cluster, it provides the frequency of closely-related homologs for which the corresponding residue is higly flexible or missing.

Model checkpoints

We provide different model checkpoints, based on training data and pre-trained PLM.

Checkpoint name	Training dataset	Pre-trained checkpoint
esm2_650M-LoRA-DisProt7	DisProt 7.0	esm2_t33_650M_UR50D
esm2_35M-LoRA-DisProt7	DisProt 7.0	esm2_t12_35M_UR50D
Ankh-LoRA-DisProt7	DisProt 7.0	ankh-large
PortT5-LoRA-DisProt7	DisProt 7.0	prot_t5_xl_uniref5
esm2_650M-LoRA-ID	Intrinsic dis.*	esm2_t33_650M_UR50D
esm2_35M-LoRA-ID	Intrinsic dis.*	esm2_t12_35M_UR50D
Ankh-LoRA-ID	Intrinsic dis.*	ankh-large
PortT5-LoRA-ID	Intrinsic dis.*	prot_t5_xl_uniref5
esm2_650M-LoRA-SD	SoftDis	esm2_t33_650M_UR50D
esm2_35M-LoRA-SD	SoftDis	esm2_t12_35M_UR50D
Ankh-LoRA-SD	SoftDis	ankh-large
PortT5-LoRA-SD	SoftDis	prot_t5_xl_uniref5

* DisProt7, CAID1 and CAID2 data

Intended uses & limitations

The models are intended to be used for classification of different disorder types.
Models for intrinsic disorder trained on DisProt 7.0 were evaluated on CAID1 and CAID2 challenge, but we suggest to use "ID" models for classification of new sequences, as they show better generalization.

In addition to its relation to flexibility and assembly pathways, soft disorder can be used to infer confidence score for structure prediciton tools, as we found high negative Spearman correlation between soft disorder probabilities and pLDDT from AlphaFold2 predicitons.

Model usage

All models can be loaded as PyTorch Modules, together with their associated tokenizer, with the following code:

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "CQSB/Ankh-LoRA-ID-DisProt7"  # model_id for selected model
model = AutoModelForTokenClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Once the model is loadded, disorder profile for all residues in a sequence can be obtained as follow:

import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# example sequence 
sequence = "TAIWEQHTVTLHRAPGFGFGIAISGGRDNPHFQSGETSIVISDVLKG"

# each pre-trained model adds its own special tokens to the tokenized sequence,  
# special_tokens_mask allows to deal with them (padding included, for batched 
# inputs) without changing the code
inputs = tokenizer(
    [sequence], return_tensors="pt", return_special_tokens_mask=True
)
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
special_tokens_mask = inputs['special_tokens_mask'].bool()

# extract predicted disorder probability
with torch.inference_mode():
    output = model(input_ids, attention_mask).logits.cpu()
output = output[~special_tokens_mask, :]
disorder_proba = F.softmax(output, dim=-1)[:, 1]

How to cite

Coming soon...

CQSB
/

Ankh-LoRA-ID-DisProt7