--- language: - en - es - fr - de library_name: tokenizers license: cc-by-4.0 tags: - kl3m - kl3m-005 - alea - legal - financial - multi-word date: '2025-03-15T00:00:00.000Z' --- # kl3m-005-multi-word-example-32k tokenizer The `kl3m-005-multi-word-example-32k` tokenizer is an experimental domain-specific tokenizer that introduces **multi-word token learning** by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology. This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the `kl3m-data` project, including American English, British English, Spanish, German, French, Italian, and other common EU languages. ## Model Details ### Summary - **Vocabulary**: 32,768 - **Tokenizer type:** BPE with multi-word capability - **Special token support:** Both causal and masked language modeling - **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages. - **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository. - **Developed by:** [ALEA Institute](https://aleainstitute.ai). - **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### Model Description The `kl3m-005-multi-word-example-32k` tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not. This tokenizer is notable for a number of reasons: #### Multi-Word Token Learning The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique: - Uses `RandomWhitespaceSplit` pre-tokenizer, which probabilistically decides whether to split on whitespace - Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States") - Improves compression and semantic coherence for common multi-word expressions - Doesn't require complex hyperparameter transitions or multi-phase training This implementation is based on the new pre-tokenizers added to the Hugging Face `tokenizers` library that enable multi-word token learning. For more information, see [Hugging Face PR #1753](https://github.com/huggingface/tokenizers/pull/1753). #### Domain Specific As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains. #### Large Added Token Set Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary: - HTML tags like ``: `0` * `<|end|>`: `1` * `<|pad|>`: `2` * `<|unk|>`: `3` * `<|sep|>`: `4` * `<|cls|>`: `5` * `<|mask|>`: `6` * `<|system|>`: `7` * ``: `8` * `<|user|>`: `9` * ``: `10` * `<|instruction|>`: `11` * ``: `12` ### Examples Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers: ```text Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment. Standard BPE tokenization: ["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."] kl3m-005-multi-word-example-32k: ["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."] ``` Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units. ### Replication The entire data collection and preprocessing pipeline is being made available as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/). The source code used to train the tokenizer is available on GitHub at: [https://github.com/alea-institute/kl3m-tokenizers](https://github.com/alea-institute/kl3m-tokenizers) ## Uses This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation. ### Recommendations The `kl3m-005-multi-word-example-32k` tokenizer is recommended for: - Legal or financial document processing where multi-word terms are common - Applications where token compression is critical - Research into multi-word token approaches - Tasks requiring better semantic coherence in tokenization For more traditional tokenization, consider the `kl3m-004-128k-cased` or other KL3M tokenizers. ## How to Get Started with the Model Use the code below to get started with the model: ```python from tokenizers import Tokenizer tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k') # Example showing multi-word tokens text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment." encoded = tokenizer.encode(text) tokens = encoded.tokens print(f"Token count: {len(tokens)}") print("Tokens:", tokens) ``` ## Citation Tokenizer and dataset publications are pending. ## Contact For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-tokenizers). ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)