File size: 3,608 Bytes
10d995a
 
 
 
 
 
 
 
8356876
10d995a
b493048
533d37e
b573649
d9f8896
f7d04c2
533d37e
10d995a
3bb6bbf
1710191
b63691f
ec6857f
b63691f
10d995a
 
 
 
 
 
 
4d1a1db
10d995a
 
 
 
 
 
 
 
 
 
f60eeb6
10d995a
 
 
 
 
 
 
 
 
 
dc98dd8
 
 
 
7d92d6e
dc98dd8
 
 
 
 
 
 
7d92d6e
dc98dd8
 
 
 
 
 
 
7d92d6e
dc98dd8
 
 
 
 
9e3f170
 
 
 
 
 
 
 
 
4e7949f
 
 
9e3f170
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: mit
tags:
- protein language model
datasets:
- Uniref50
---

# DistilProtBert

A distilled version of [ProtBert-UniRef100](https://huggingface.co/Rostlab/prot_bert) model.
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids. 

Check out our paper [DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts](https://doi.org/10.1093/bioinformatics/btac474) for more details.

[Git](https://github.com/yarongef/DistilProtBert) repository.

# Model details
|    **Model**   | **# of parameters** | **# of hidden layers** | **Pretraining dataset** | **# of proteins** | **Pretraining hardware** |
|:--------------:|:-------------------:|:----------------------:|:-----------------------:|:------------------------------:|:------------------------:|
|    ProtBert    |         420M        |           30           |        UniRef100        |              216M              |       512 16GB TPUs      |
| DistilProtBert |         230M        |           15           |         UniRef50        |               43M              |     5 v100 32GB GPUs     |

## Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

### How to use

The model can be used the same as ProtBert and with ProtBert's tokenizer.

## Training data

DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

# Pretraining procedure

Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)). 

The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

| Task/Dataset | secondary structure (3-states) | Membrane  |
|:-----:|:-----:|:-----:|
|   CASP12  | 72 |    |
|   TS115   | 81 |    | 
|   CB513   | 79 |    |
|  DeepLoc  |    | 86 | 

Distinguish between proteins and their k-let shuffled versions:

_Singlet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_singlets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.71  |
|    ProtBert    |   0.93  |
| DistilProtBert |   0.92  |

_Doublet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_doublets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.68  |
|    ProtBert    |   0.92  |
| DistilProtBert |   0.91  |

_Triplet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_triplets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.61  |
|    ProtBert    |   0.92  |
| DistilProtBert |   0.87  |

## **Citation**
If you use this model, please cite our paper:
```
@article {
	author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron},
	title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts},
	year = {2022},
	doi = {10.1093/bioinformatics/btac474},
	URL = {https://doi.org/10.1093/bioinformatics/btac474},
	journal = {Bioinformatics}
}
```