wordpiece-tokenizer-32k-en_code-msp

A 'modern' uncased wordpiece tokenizer for MLM, analogous to bert-base-uncased's tokenizer.

32k vocab size, uncased. Trained with max alphabet of 1000 and min_freq of 5
Unique WordPiece tokenizer that preserves whitespace information.
- Done through ~~witchcraft~~ a combination of Metaspace() and custom grouping/filtering logic
trained on english/code via pints-ai/Expository-Prose-V1 to leverage ^

Usage

from transformers import AutoTokenizer

repo_id = "BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# same usage as any other tokenizer for encoder models

Comparison vs bert-base-uncased

vocab

Click to see comparison code

Code to run the below comparison:

import random

from transformers import AutoTokenizer

tk_base = AutoTokenizer.from_pretrained("bert-base-uncased")
tk_retrained = AutoTokenizer.from_pretrained("BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp")

# Get vocabularies as sets
vocab_base = set(tk_base.get_vocab().keys())
vocab_retrained = set(tk_retrained.get_vocab().keys())

# Compare vocabularies
common_tokens = vocab_base.intersection(vocab_retrained)
unique_to_base = vocab_base.difference(vocab_retrained)
unique_to_retrained = vocab_retrained.difference(vocab_base)

# Print results
print(f"Total tokens in base tokenizer: {len(vocab_base)}")
print(f"Total tokens in retrained tokenizer: {len(vocab_retrained)}")
print(f"Number of common tokens: {len(common_tokens)}")
print(f"Tokens unique to base tokenizer: {len(unique_to_base)}")
print(f"Tokens unique to retrained tokenizer: {len(unique_to_retrained)}")

# Optionally print a few examples
print("\nExamples of common tokens:", random.sample(list(common_tokens), k=10))
print("\nExamples of tokens unique to base:", random.sample(list(unique_to_base), k=20))
print(
    "\nExamples of tokens unique to retrained:",
    random.sample(list(unique_to_retrained), k=20)
)

Total tokens in base tokenizer: 30522
Total tokens in retrained tokenizer: 31999
Number of common tokens: 6719
Tokens unique to base tokenizer: 23803
Tokens unique to retrained tokenizer: 25280

Examples of common tokens: ['1908', 'ロ', 'pa', 'jiang', '##ibly', '1966', '##>', 'wind', '##ried', '天']

Examples of tokens unique to base: ['[unused686]', '[unused146]', 'jr', 'groves', 'janeiro', '氵', '[unused768]', 'abusive', 'illustrated', 'veteran', 'blitz', 'audio', 'lafayette', 'mice', 'pedersen', 'bharatiya', 'kerman', 'computed', 'broker', 'late']

Examples of tokens unique to retrained: ['454', '▁traveller', '▁peaked', '▁outflow', '##ributions', '##发', '▁more', '▁simon', '▁pok', '▁pounds', '▁ventric', '▁psychological', '455', '▁vi', '##bits', '##tex', '▁wing', '▁want', '▁cleans', '▁fac']

whitespace encoding

TODO: update for latest version

BEE-spoke-data
/

wordpiece-tokenizer-32k-en_code-msp

wordpiece-tokenizer-32k-en_code-msp

Usage

Comparison vs bert-base-uncased

vocab

whitespace encoding

Dataset used to train BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp

Collection including BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp

tokenizers