alea-institute
/

kl3m-005-multi-word-example-32k

+---
+language:
+- en
+- es
+- fr
+- de
+library_name: tokenizers
+license: cc-by-4.0
+tags:
+- kl3m
+- kl3m-005
+- alea
+- legal
+- financial
+- multi-word
+date: '2025-03-15T00:00:00.000Z'
+---
+# kl3m-005-multi-word-example-32k tokenizer
+The `kl3m-005-multi-word-example-32k` tokenizer is an experimental domain-specific tokenizer that introduces **multi-word token learning** by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology.
+This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the `kl3m-data` project, including American English, British English, Spanish, German, French, Italian, and other common EU languages.
+## Model Details
+### Summary
+- **Vocabulary**: 32,768
+- **Tokenizer type:** BPE with multi-word capability
+- **Special token support:** Both causal and masked language modeling
+- **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
+- **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
+- **Developed by:** [ALEA Institute](https://aleainstitute.ai).
+- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
+### Model Description
+The `kl3m-005-multi-word-example-32k` tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not.
+This tokenizer is notable for a number of reasons:
+#### Multi-Word Token Learning
+The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique:
+- Uses `RandomWhitespaceSplit` pre-tokenizer, which probabilistically decides whether to split on whitespace
+- Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States")
+- Improves compression and semantic coherence for common multi-word expressions
+- Doesn't require complex hyperparameter transitions or multi-phase training
+This implementation is based on the new pre-tokenizers added to the Hugging Face `tokenizers` library that enable multi-word token learning. For more information, see [Hugging Face PR #1753](https://github.com/huggingface/tokenizers/pull/1753).
+#### Domain Specific
+As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains.
+#### Large Added Token Set
+Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary:
+- HTML tags like `<span`
+- Common Markdown elements like `#` and `##`
+- Legal enumerations like `(a)`
+- Academic and legal citations
+#### Special Tokens
+For both training and inference efficiency, we included special tokens suitable for both causal and masked language modeling tasks:
+* `<|start|>`: `0`
+* `<|end|>`: `1`
+* `<|pad|>`: `2`
+* `<|unk|>`: `3`
+* `<|sep|>`: `4`
+* `<|cls|>`: `5`
+* `<|mask|>`: `6`
+* `<|system|>`: `7`
+* `</|system|>`: `8`
+* `<|user|>`: `9`
+* `</|user|>`: `10`
+* `<|instruction|>`: `11`
+* `</|instruction|>`: `12`
+### Examples
+Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers:
+```text
+Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment.
+Standard BPE tokenization:
+["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."]
+kl3m-005-multi-word-example-32k:
+["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."]
+```
+Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units.
+### Replication
+The entire data collection and preprocessing pipeline is being made available as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
+The source code used to train the tokenizer is available on GitHub at:
+[https://github.com/alea-institute/kl3m-tokenizers](https://github.com/alea-institute/kl3m-tokenizers)
+## Uses
+This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation.
+### Recommendations
+The `kl3m-005-multi-word-example-32k` tokenizer is recommended for:
+- Legal or financial document processing where multi-word terms are common
+- Applications where token compression is critical
+- Research into multi-word token approaches
+- Tasks requiring better semantic coherence in tokenization
+For more traditional tokenization, consider the `kl3m-004-128k-cased` or other KL3M tokenizers.
+## How to Get Started with the Model
+Use the code below to get started with the model:
+```python
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k')
+# Example showing multi-word tokens
+text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment."
+encoded = tokenizer.encode(text)
+tokens = encoded.tokens
+print(f"Token count: {len(tokens)}")
+print("Tokens:", tokens)
+```
+## Citation
+Tokenizer and dataset publications are pending.
+## Contact
+For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
+create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-tokenizers).
+![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)