Update README and config files - README.md
Browse files
README.md
ADDED
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- es
|
5 |
+
- fr
|
6 |
+
- de
|
7 |
+
library_name: tokenizers
|
8 |
+
license: cc-by-4.0
|
9 |
+
tags:
|
10 |
+
- kl3m
|
11 |
+
- kl3m-005
|
12 |
+
- alea
|
13 |
+
- legal
|
14 |
+
- financial
|
15 |
+
- multi-word
|
16 |
+
date: '2025-03-15T00:00:00.000Z'
|
17 |
+
---
|
18 |
+
|
19 |
+
# kl3m-005-multi-word-example-32k tokenizer
|
20 |
+
|
21 |
+
The `kl3m-005-multi-word-example-32k` tokenizer is an experimental domain-specific tokenizer that introduces **multi-word token learning** by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology.
|
22 |
+
|
23 |
+
This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the `kl3m-data` project, including American English, British English, Spanish, German, French, Italian, and other common EU languages.
|
24 |
+
|
25 |
+
## Model Details
|
26 |
+
|
27 |
+
### Summary
|
28 |
+
|
29 |
+
- **Vocabulary**: 32,768
|
30 |
+
- **Tokenizer type:** BPE with multi-word capability
|
31 |
+
- **Special token support:** Both causal and masked language modeling
|
32 |
+
- **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
|
33 |
+
- **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
|
34 |
+
- **Developed by:** [ALEA Institute](https://aleainstitute.ai).
|
35 |
+
- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
36 |
+
|
37 |
+
### Model Description
|
38 |
+
|
39 |
+
The `kl3m-005-multi-word-example-32k` tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not.
|
40 |
+
|
41 |
+
This tokenizer is notable for a number of reasons:
|
42 |
+
|
43 |
+
#### Multi-Word Token Learning
|
44 |
+
|
45 |
+
The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique:
|
46 |
+
|
47 |
+
- Uses `RandomWhitespaceSplit` pre-tokenizer, which probabilistically decides whether to split on whitespace
|
48 |
+
- Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States")
|
49 |
+
- Improves compression and semantic coherence for common multi-word expressions
|
50 |
+
- Doesn't require complex hyperparameter transitions or multi-phase training
|
51 |
+
|
52 |
+
This implementation is based on the new pre-tokenizers added to the Hugging Face `tokenizers` library that enable multi-word token learning. For more information, see [Hugging Face PR #1753](https://github.com/huggingface/tokenizers/pull/1753).
|
53 |
+
|
54 |
+
#### Domain Specific
|
55 |
+
|
56 |
+
As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains.
|
57 |
+
|
58 |
+
#### Large Added Token Set
|
59 |
+
|
60 |
+
Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary:
|
61 |
+
|
62 |
+
- HTML tags like `<span`
|
63 |
+
- Common Markdown elements like `#` and `##`
|
64 |
+
- Legal enumerations like `(a)`
|
65 |
+
- Academic and legal citations
|
66 |
+
|
67 |
+
#### Special Tokens
|
68 |
+
|
69 |
+
For both training and inference efficiency, we included special tokens suitable for both causal and masked language modeling tasks:
|
70 |
+
|
71 |
+
* `<|start|>`: `0`
|
72 |
+
* `<|end|>`: `1`
|
73 |
+
* `<|pad|>`: `2`
|
74 |
+
* `<|unk|>`: `3`
|
75 |
+
* `<|sep|>`: `4`
|
76 |
+
* `<|cls|>`: `5`
|
77 |
+
* `<|mask|>`: `6`
|
78 |
+
* `<|system|>`: `7`
|
79 |
+
* `</|system|>`: `8`
|
80 |
+
* `<|user|>`: `9`
|
81 |
+
* `</|user|>`: `10`
|
82 |
+
* `<|instruction|>`: `11`
|
83 |
+
* `</|instruction|>`: `12`
|
84 |
+
|
85 |
+
### Examples
|
86 |
+
|
87 |
+
Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers:
|
88 |
+
|
89 |
+
```text
|
90 |
+
Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment.
|
91 |
+
|
92 |
+
Standard BPE tokenization:
|
93 |
+
["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."]
|
94 |
+
|
95 |
+
kl3m-005-multi-word-example-32k:
|
96 |
+
["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."]
|
97 |
+
```
|
98 |
+
|
99 |
+
Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units.
|
100 |
+
|
101 |
+
### Replication
|
102 |
+
|
103 |
+
The entire data collection and preprocessing pipeline is being made available as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
|
104 |
+
|
105 |
+
The source code used to train the tokenizer is available on GitHub at:
|
106 |
+
[https://github.com/alea-institute/kl3m-tokenizers](https://github.com/alea-institute/kl3m-tokenizers)
|
107 |
+
|
108 |
+
## Uses
|
109 |
+
|
110 |
+
This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation.
|
111 |
+
|
112 |
+
### Recommendations
|
113 |
+
|
114 |
+
The `kl3m-005-multi-word-example-32k` tokenizer is recommended for:
|
115 |
+
|
116 |
+
- Legal or financial document processing where multi-word terms are common
|
117 |
+
- Applications where token compression is critical
|
118 |
+
- Research into multi-word token approaches
|
119 |
+
- Tasks requiring better semantic coherence in tokenization
|
120 |
+
|
121 |
+
For more traditional tokenization, consider the `kl3m-004-128k-cased` or other KL3M tokenizers.
|
122 |
+
|
123 |
+
## How to Get Started with the Model
|
124 |
+
|
125 |
+
Use the code below to get started with the model:
|
126 |
+
|
127 |
+
```python
|
128 |
+
from tokenizers import Tokenizer
|
129 |
+
|
130 |
+
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k')
|
131 |
+
|
132 |
+
# Example showing multi-word tokens
|
133 |
+
text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment."
|
134 |
+
encoded = tokenizer.encode(text)
|
135 |
+
tokens = encoded.tokens
|
136 |
+
|
137 |
+
print(f"Token count: {len(tokens)}")
|
138 |
+
print("Tokens:", tokens)
|
139 |
+
```
|
140 |
+
|
141 |
+
## Citation
|
142 |
+
|
143 |
+
Tokenizer and dataset publications are pending.
|
144 |
+
|
145 |
+
## Contact
|
146 |
+
|
147 |
+
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
|
148 |
+
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-tokenizers).
|
149 |
+
|
150 |
+

|