Last commit not found
license: mit | |
language: protein | |
tags: | |
- protein language model | |
datasets: | |
- Uniref50 | |
# DistilProtBert model | |
Distilled version of [ProtBert](https://huggingface.co/Rostlab/prot_bert) model. | |
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids. | |
# Model description | |
DistilProtBert was pretrained on millions of proteins sequences. | |
Few important differences between DistilProtBert model and the original ProtBert version are: | |
1. The size of the model | |
2. The size of the pretraining dataset | |
3. Time & hardware used for pretraining | |
## Intended uses & limitations | |
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. | |
### How to use | |
The model can be used the same as ProtBert. | |
## Training data | |
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used). | |
# Pretraining procedure | |
Preprocessing was done using ProtBert's tokenizer. | |
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)). | |
The model was pretrained on a single DGX cluster 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings. | |
## Evaluation results | |
When fine-tuned on downstream tasks, this model achieves the following results: | |
| Task/Dataset | secondary structure (3-states) | Membrane | | |
|:-----:|:-----:|:-----:| | |
| CASP12 | 72 | | | |
| TS115 | 81 | | | |
| CB513 | 79 | | | |
| DeepLoc | | 86 | | |
Distinguish between: | |
### BibTeX entry and citation info |