wordpiece-tokenizer-32k-en_code-orig
Standard wordpiece tokenizer trained on the expository prose dataset. Follows the "original" wordpiece logic (as in the BERT tokenizer) and does not tokenize/account for whitespace (i.e. all whitespace is normalized to a single space, then pretokenizer splits on whitespace).
This exists for comparison to BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
comparison vs BERT/mpnet tokenizer
bert-base-uncased
's tokenizer is the 'base tokenizer' in the below
Total tokens in base tokenizer: 30527
Total tokens in retrained tokenizer: 31999
Number of common tokens: 19535
Tokens unique to base tokenizer: 10992
Tokens unique to retrained tokenizer: 12464
Example common tokens:['##cts', 'accounted', '##rik', 'fairness', 'music', 'dragons', 'manga', 'vermont', 'matters', '##iting']
Example tokens unique to base:['federer', 'caucasian', 'remade', '##დ', '[unused134]', 'downfall', 'sahib', '[unused225]', '##ngo', '[unused684]', 'scared', '##gated', 'grinned', 'slick', 'bahn', '##〉', '##reus', 'ufo', 'gathers', 'bayern']
Example tokens unique to retrained:['odot', '##dx', 'mathscr', '##517', 'matplotlib', 'cruc', 'tlie', '##osl', 'qg', 'oc', 'sach', '##colsep', '479', 'conclud', 'iniqu', '##ahan', 'pn', 'foref', 'rapidity', 'faraday']