smol-er GPT-NeoX Tokenizer
This tokenizer contains 32,023 tokens, so models that use it should round the vocab size up to a multiple of 64/128 (ie 32064 or 32128) for CUDA reasons.
Compared to Original GPT-NeoX
The 'base tokenizer' in the below is the tokenizer from
EleutherAI/pythia-1b
Total tokens in base tokenizer: 50277
Total tokens in retrained tokenizer: 32023
Number of common tokens: 28219
Tokens unique to base tokenizer: 22058
Tokens unique to retrained tokenizer: 3804
Example common tokens:['ounter', 'ĠRaymond', 'ĠIP', 'Ġcontroversy', '=', 'Ġsituations', 'Ġclimbed', 'Ġtrac', 'XY', 'Ġhave']
Example tokens unique to base:['ĠShift', 'åħĥ', 'dling', 'ĠUL', 'ďă', 'ooter', 'Ġrandomization', 'Ġsprite', 'iab', 'notice', 'ilate', 'æĻĤ', 'Ġoutdoors', 'Ġ--------------------------', 'organization', 'Ġcitrate', 'ĠUber', 'Ġabandonment', 'Ġacquittal', 'Ġrestraint']
Example tokens unique to retrained:['ittee', 'Ġibn', 'ĠInteg', 'decay', 'stability', '|}{', 'Ġhia', 'egu', 'ĊĠĉ', '}{$', '}&=', 'Ġ1837', '--(', "'}{", 'Ġgoalkeeper', 'rossover', 'cious', 'unsupervised', 'cmid', 'ĊĠĠĠĠĠĠĠĠĠĠĊĠĠĠĠĠĠĠĠĠĠĠ']