dotan1111 commited on
Commit
e35952a
·
1 Parent(s): 2658d35

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +52 -0
  2. tokenizer.json +430 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - biology
4
+ - bioinformatics
5
+ - tokenizers
6
+ ---
7
+ # Effect of Tokenization on Transformers for Biological Sequences
8
+ ## Abstract:
9
+ Deep learning models are transforming biological research. Many bioinformatics and comparative genomics algorithms analyze genomic data, either DNA or protein sequences. Examples include sequence alignments, phylogenetic tree inference and automatic classification of protein functions. Among these deep learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different than natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.
10
+
11
+ ![image](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/d69893e2-7114-41a8-8d46-9b025b2d2840)
12
+
13
+ Different tokenization algorithms can be applied to biological sequences, as exemplified for the sequence “AAGTCAAGGATC”. (a) The baseline “words” tokenizer assumes a dictionary consisting of the nucleotides: “A”, “C”, “G” and “T”. The length of the encoded sequence is 12, i.e., the number of nucleotides; (b) The “pairs” tokenizer assumes a dictionary consisting of all possible nucleotide pairs. The length of the encoded sequences is typically halved; (c) A sophisticated dictionary consisting of only three tokens: “AAG”, “TC” and “GA”. The encoded sequence for this dictionary contains only five tokens.
14
+
15
+ ## Data:
16
+ The "data" folder contains the train, valid and test data of seven of the eight datasets used in the paper.
17
+
18
+ ## BFD Tokenizers:
19
+
20
+ We trained BPE, WordPiece and Unigram tokenizers on samples of proteins from the 2.2 billion protein sequences of the BFD dataset (Steinegger and Söding 2018). We evaluate the average sequences length as a function of the vocabulary size and number of sequences in the training data.
21
+
22
+ ![BFD_BPE_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/710b7aa7-0dde-46bb-9ddf-39a84b579d71)
23
+ ![BFD_WPC_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/8adfe5a7-25f5-4723-a87a-8598c6a76ff6)
24
+ ![BFD_UNI_table](https://github.com/idotan286/BiologicalTokenizers/assets/58917533/4462e782-0b21-4377-a5fe-309685141538)
25
+
26
+ Effect of vocabulary size and number of training samples on the three tokenizers: BPE, WordPiece and Unigram. The darker the color the higher the average number of tokens per protein. Increasing the vocabulary and the training size reduces the number of tokens per protein for all of the tested tokenizers.
27
+
28
+ We uploaded the "BFD_Tokenizers" which been trained on 10,000,000 sequences randomly sampled from the BFD datasset.
29
+
30
+ ## Github
31
+
32
+ The code, datasets and trained tokenizers are available on https://github.com/idotan286/BiologicalTokenizers/.
33
+
34
+ ## APA
35
+
36
+ ```
37
+ Dotan, E., Jaschek, G., Pupko, T., & Belinkov, Y. (2023). Effect of Tokenization on Transformers for Biological Sequences. bioRxiv. https://doi.org/10.1101/2023.08.15.553415
38
+ ```
39
+
40
+
41
+ ## BibTeX
42
+ ```
43
+ @article{Dotan_Effect_of_Tokenization_2023,
44
+ author = {Dotan, Edo and Jaschek, Gal and Pupko, Tal and Belinkov, Yonatan},
45
+ doi = {10.1101/2023.08.15.553415},
46
+ journal = {bioRxiv},
47
+ month = aug,
48
+ title = {{Effect of Tokenization on Transformers for Biological Sequences}},
49
+ year = {2023}
50
+ }
51
+
52
+ ```
tokenizer.json ADDED
@@ -0,0 +1,430 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "<UNK>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ }
15
+ ],
16
+ "normalizer": {
17
+ "type": "Lowercase"
18
+ },
19
+ "pre_tokenizer": {
20
+ "type": "Whitespace"
21
+ },
22
+ "post_processor": null,
23
+ "decoder": null,
24
+ "model": {
25
+ "type": "Unigram",
26
+ "unk_id": 0,
27
+ "vocab": [
28
+ [
29
+ "<UNK>",
30
+ 0.0
31
+ ],
32
+ [
33
+ "i",
34
+ -3.0203980367124537
35
+ ],
36
+ [
37
+ "t",
38
+ -3.062956886044912
39
+ ],
40
+ [
41
+ "k",
42
+ -3.082041895330107
43
+ ],
44
+ [
45
+ "f",
46
+ -3.103728012092901
47
+ ],
48
+ [
49
+ "q",
50
+ -3.135578439066535
51
+ ],
52
+ [
53
+ "a",
54
+ -3.1735611832016133
55
+ ],
56
+ [
57
+ "e",
58
+ -3.2414619148167
59
+ ],
60
+ [
61
+ "s",
62
+ -3.245056006002681
63
+ ],
64
+ [
65
+ "r",
66
+ -3.2453138014528733
67
+ ],
68
+ [
69
+ "d",
70
+ -3.247259733320952
71
+ ],
72
+ [
73
+ "g",
74
+ -3.2735768039326345
75
+ ],
76
+ [
77
+ "l",
78
+ -3.2758473315755303
79
+ ],
80
+ [
81
+ "n",
82
+ -3.2874450246220093
83
+ ],
84
+ [
85
+ "p",
86
+ -3.3366652351080646
87
+ ],
88
+ [
89
+ "v",
90
+ -3.4634740370979564
91
+ ],
92
+ [
93
+ "h",
94
+ -3.5246874642761234
95
+ ],
96
+ [
97
+ "y",
98
+ -3.5423092969956507
99
+ ],
100
+ [
101
+ "m",
102
+ -3.7572360683510553
103
+ ],
104
+ [
105
+ "w",
106
+ -4.1068255516984
107
+ ],
108
+ [
109
+ "c",
110
+ -4.1691497756131
111
+ ],
112
+ [
113
+ "aa",
114
+ -4.585326076591901
115
+ ],
116
+ [
117
+ "la",
118
+ -4.887530882982123
119
+ ],
120
+ [
121
+ "al",
122
+ -4.992849418633071
123
+ ],
124
+ [
125
+ "ll",
126
+ -5.029207021549425
127
+ ],
128
+ [
129
+ "ag",
130
+ -5.034617504795474
131
+ ],
132
+ [
133
+ "gg",
134
+ -5.209981982112646
135
+ ],
136
+ [
137
+ "rr",
138
+ -5.219453257678689
139
+ ],
140
+ [
141
+ "va",
142
+ -5.284565871717094
143
+ ],
144
+ [
145
+ "av",
146
+ -5.309113365418225
147
+ ],
148
+ [
149
+ "lg",
150
+ -5.335946291080051
151
+ ],
152
+ [
153
+ "ar",
154
+ -5.346487427622183
155
+ ],
156
+ [
157
+ "ga",
158
+ -5.347768184015516
159
+ ],
160
+ [
161
+ "rl",
162
+ -5.359188586198691
163
+ ],
164
+ [
165
+ "ra",
166
+ -5.388899095021582
167
+ ],
168
+ [
169
+ "lv",
170
+ -5.415610462404311
171
+ ],
172
+ [
173
+ "vl",
174
+ -5.442687278673805
175
+ ],
176
+ [
177
+ "pa",
178
+ -5.4566928056054405
179
+ ],
180
+ [
181
+ "gl",
182
+ -5.457117261276412
183
+ ],
184
+ [
185
+ "lr",
186
+ -5.502800422610287
187
+ ],
188
+ [
189
+ "vv",
190
+ -5.5241406506219
191
+ ],
192
+ [
193
+ "gr",
194
+ -5.610282811468741
195
+ ],
196
+ [
197
+ "gv",
198
+ -5.634492053238818
199
+ ],
200
+ [
201
+ "ae",
202
+ -5.636273072018625
203
+ ],
204
+ [
205
+ "ls",
206
+ -5.649664173419566
207
+ ],
208
+ [
209
+ "sg",
210
+ -5.651066151953925
211
+ ],
212
+ [
213
+ "vg",
214
+ -5.699240313159557
215
+ ],
216
+ [
217
+ "pg",
218
+ -5.705567860524408
219
+ ],
220
+ [
221
+ "sl",
222
+ -5.714505001184756
223
+ ],
224
+ [
225
+ "sa",
226
+ -5.72066748094986
227
+ ],
228
+ [
229
+ "as",
230
+ -5.726946393087237
231
+ ],
232
+ [
233
+ "dl",
234
+ -5.731060965222051
235
+ ],
236
+ [
237
+ "el",
238
+ -5.735435604764186
239
+ ],
240
+ [
241
+ "ss",
242
+ -5.73815657796286
243
+ ],
244
+ [
245
+ "da",
246
+ -5.743365731914071
247
+ ],
248
+ [
249
+ "rg",
250
+ -5.748123393774561
251
+ ],
252
+ [
253
+ "le",
254
+ -5.7677881761217655
255
+ ],
256
+ [
257
+ "ia",
258
+ -5.776944415806298
259
+ ],
260
+ [
261
+ "ta",
262
+ -5.77737040425999
263
+ ],
264
+ [
265
+ "ld",
266
+ -5.782381548600611
267
+ ],
268
+ [
269
+ "ea",
270
+ -5.805221911324413
271
+ ],
272
+ [
273
+ "tl",
274
+ -5.815791178543998
275
+ ],
276
+ [
277
+ "ad",
278
+ -5.831030225259809
279
+ ],
280
+ [
281
+ "dg",
282
+ -5.842558129083246
283
+ ],
284
+ [
285
+ "lp",
286
+ -5.842667174329858
287
+ ],
288
+ [
289
+ "tg",
290
+ -5.856628213333442
291
+ ],
292
+ [
293
+ "gs",
294
+ -5.88330933978234
295
+ ],
296
+ [
297
+ "rv",
298
+ -5.887710110137677
299
+ ],
300
+ [
301
+ "pl",
302
+ -5.901498352812656
303
+ ],
304
+ [
305
+ "er",
306
+ -5.92200381405967
307
+ ],
308
+ [
309
+ "x",
310
+ -5.933807364971587
311
+ ],
312
+ [
313
+ "at",
314
+ -5.943452569412951
315
+ ],
316
+ [
317
+ "rs",
318
+ -5.9468087421559535
319
+ ],
320
+ [
321
+ "vr",
322
+ -5.964121749447516
323
+ ],
324
+ [
325
+ "pv",
326
+ -5.977135678629862
327
+ ],
328
+ [
329
+ "ve",
330
+ -5.980805107024498
331
+ ],
332
+ [
333
+ "lt",
334
+ -5.986052614892426
335
+ ],
336
+ [
337
+ "ap",
338
+ -5.995057905935505
339
+ ],
340
+ [
341
+ "ge",
342
+ -6.010117098429783
343
+ ],
344
+ [
345
+ "pp",
346
+ -6.027936858412414
347
+ ],
348
+ [
349
+ "re",
350
+ -6.041928176100955
351
+ ],
352
+ [
353
+ "dv",
354
+ -6.043940338964116
355
+ ],
356
+ [
357
+ "tv",
358
+ -6.059045706661092
359
+ ],
360
+ [
361
+ "ig",
362
+ -6.061067377196126
363
+ ],
364
+ [
365
+ "vs",
366
+ -6.06266048412966
367
+ ],
368
+ [
369
+ "gd",
370
+ -6.09875857638583
371
+ ],
372
+ [
373
+ "vd",
374
+ -6.118779472304132
375
+ ],
376
+ [
377
+ "sv",
378
+ -6.128847930365273
379
+ ],
380
+ [
381
+ "rp",
382
+ -6.138786760719885
383
+ ],
384
+ [
385
+ "ee",
386
+ -6.142724936012062
387
+ ],
388
+ [
389
+ "ps",
390
+ -6.153316985162256
391
+ ],
392
+ [
393
+ "de",
394
+ -6.204250643531047
395
+ ],
396
+ [
397
+ "il",
398
+ -6.207346894537773
399
+ ],
400
+ [
401
+ "ev",
402
+ -6.242737226311572
403
+ ],
404
+ [
405
+ "rd",
406
+ -6.269418524522775
407
+ ],
408
+ [
409
+ "sp",
410
+ -6.338806140439619
411
+ ],
412
+ [
413
+ "u",
414
+ -17.79582990121073
415
+ ],
416
+ [
417
+ "b",
418
+ -17.996473404241875
419
+ ],
420
+ [
421
+ "z",
422
+ -19.23728665450807
423
+ ],
424
+ [
425
+ "o",
426
+ -20.68728665451499
427
+ ]
428
+ ]
429
+ }
430
+ }