Update README.md
Browse files
README.md
CHANGED
@@ -20,17 +20,17 @@ This exists for comparison to [BEE-spoke-data/wordpiece-tokenizer-32k-en_code-ms
|
|
20 |
|
21 |
where `bert-base-uncased`'s tokenizer is the base tokenizer:
|
22 |
|
23 |
-
Total tokens in base tokenizer: 30527
|
24 |
-
Total tokens in retrained tokenizer: 31999
|
25 |
-
Number of common tokens: 19535
|
26 |
-
Tokens unique to base tokenizer: 10992
|
27 |
Tokens unique to retrained tokenizer: 12464
|
28 |
-
|
29 |
-
Example common tokens:
|
30 |
`['##cts', 'accounted', '##rik', 'fairness', 'music', 'dragons', 'manga', 'vermont', 'matters', '##iting']`
|
31 |
|
32 |
-
Example tokens unique to base:
|
33 |
`['federer', 'caucasian', 'remade', '##დ', '[unused134]', 'downfall', 'sahib', '[unused225]', '##ngo', '[unused684]', 'scared', '##gated', 'grinned', 'slick', 'bahn', '##〉', '##reus', 'ufo', 'gathers', 'bayern']`
|
34 |
|
35 |
-
Example tokens unique to retrained:
|
36 |
`['odot', '##dx', 'mathscr', '##517', 'matplotlib', 'cruc', 'tlie', '##osl', 'qg', 'oc', 'sach', '##colsep', '479', 'conclud', 'iniqu', '##ahan', 'pn', 'foref', 'rapidity', 'faraday']`
|
|
|
20 |
|
21 |
where `bert-base-uncased`'s tokenizer is the base tokenizer:
|
22 |
|
23 |
+
Total tokens in base tokenizer: 30527
|
24 |
+
Total tokens in retrained tokenizer: 31999
|
25 |
+
Number of common tokens: 19535
|
26 |
+
Tokens unique to base tokenizer: 10992
|
27 |
Tokens unique to retrained tokenizer: 12464
|
28 |
+
|
29 |
+
Example common tokens:
|
30 |
`['##cts', 'accounted', '##rik', 'fairness', 'music', 'dragons', 'manga', 'vermont', 'matters', '##iting']`
|
31 |
|
32 |
+
Example tokens unique to base:
|
33 |
`['federer', 'caucasian', 'remade', '##დ', '[unused134]', 'downfall', 'sahib', '[unused225]', '##ngo', '[unused684]', 'scared', '##gated', 'grinned', 'slick', 'bahn', '##〉', '##reus', 'ufo', 'gathers', 'bayern']`
|
34 |
|
35 |
+
Example tokens unique to retrained:
|
36 |
`['odot', '##dx', 'mathscr', '##517', 'matplotlib', 'cruc', 'tlie', '##osl', 'qg', 'oc', 'sach', '##colsep', '479', 'conclud', 'iniqu', '##ahan', 'pn', 'foref', 'rapidity', 'faraday']`
|