Transformers
English
pszemraj commited on
Commit
00fe09e
·
verified ·
1 Parent(s): 1d3750a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -20,17 +20,17 @@ This exists for comparison to [BEE-spoke-data/wordpiece-tokenizer-32k-en_code-ms
20
 
21
  where `bert-base-uncased`'s tokenizer is the base tokenizer:
22
 
23
- Total tokens in base tokenizer: 30527
24
- Total tokens in retrained tokenizer: 31999
25
- Number of common tokens: 19535
26
- Tokens unique to base tokenizer: 10992
27
  Tokens unique to retrained tokenizer: 12464
28
-
29
- Example common tokens:
30
  `['##cts', 'accounted', '##rik', 'fairness', 'music', 'dragons', 'manga', 'vermont', 'matters', '##iting']`
31
 
32
- Example tokens unique to base:
33
  `['federer', 'caucasian', 'remade', '##დ', '[unused134]', 'downfall', 'sahib', '[unused225]', '##ngo', '[unused684]', 'scared', '##gated', 'grinned', 'slick', 'bahn', '##〉', '##reus', 'ufo', 'gathers', 'bayern']`
34
 
35
- Example tokens unique to retrained:
36
  `['odot', '##dx', 'mathscr', '##517', 'matplotlib', 'cruc', 'tlie', '##osl', 'qg', 'oc', 'sach', '##colsep', '479', 'conclud', 'iniqu', '##ahan', 'pn', 'foref', 'rapidity', 'faraday']`
 
20
 
21
  where `bert-base-uncased`'s tokenizer is the base tokenizer:
22
 
23
+ Total tokens in base tokenizer: 30527
24
+ Total tokens in retrained tokenizer: 31999
25
+ Number of common tokens: 19535
26
+ Tokens unique to base tokenizer: 10992
27
  Tokens unique to retrained tokenizer: 12464
28
+
29
+ Example common tokens:
30
  `['##cts', 'accounted', '##rik', 'fairness', 'music', 'dragons', 'manga', 'vermont', 'matters', '##iting']`
31
 
32
+ Example tokens unique to base:
33
  `['federer', 'caucasian', 'remade', '##დ', '[unused134]', 'downfall', 'sahib', '[unused225]', '##ngo', '[unused684]', 'scared', '##gated', 'grinned', 'slick', 'bahn', '##〉', '##reus', 'ufo', 'gathers', 'bayern']`
34
 
35
+ Example tokens unique to retrained:
36
  `['odot', '##dx', 'mathscr', '##517', 'matplotlib', 'cruc', 'tlie', '##osl', 'qg', 'oc', 'sach', '##colsep', '479', 'conclud', 'iniqu', '##ahan', 'pn', 'foref', 'rapidity', 'faraday']`