alea-institute commited on
Commit
6e5a4d7
·
verified ·
1 Parent(s): 8828f09

Update README and config files - README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ - fr
6
+ - de
7
+ library_name: tokenizers
8
+ license: cc-by-4.0
9
+ tags:
10
+ - kl3m
11
+ - kl3m-005
12
+ - alea
13
+ - legal
14
+ - financial
15
+ - multi-word
16
+ date: '2025-03-15T00:00:00.000Z'
17
+ ---
18
+
19
+ # kl3m-005-multi-word-example-32k tokenizer
20
+
21
+ The `kl3m-005-multi-word-example-32k` tokenizer is an experimental domain-specific tokenizer that introduces **multi-word token learning** by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology.
22
+
23
+ This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the `kl3m-data` project, including American English, British English, Spanish, German, French, Italian, and other common EU languages.
24
+
25
+ ## Model Details
26
+
27
+ ### Summary
28
+
29
+ - **Vocabulary**: 32,768
30
+ - **Tokenizer type:** BPE with multi-word capability
31
+ - **Special token support:** Both causal and masked language modeling
32
+ - **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
33
+ - **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
34
+ - **Developed by:** [ALEA Institute](https://aleainstitute.ai).
35
+ - **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
36
+
37
+ ### Model Description
38
+
39
+ The `kl3m-005-multi-word-example-32k` tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not.
40
+
41
+ This tokenizer is notable for a number of reasons:
42
+
43
+ #### Multi-Word Token Learning
44
+
45
+ The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique:
46
+
47
+ - Uses `RandomWhitespaceSplit` pre-tokenizer, which probabilistically decides whether to split on whitespace
48
+ - Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States")
49
+ - Improves compression and semantic coherence for common multi-word expressions
50
+ - Doesn't require complex hyperparameter transitions or multi-phase training
51
+
52
+ This implementation is based on the new pre-tokenizers added to the Hugging Face `tokenizers` library that enable multi-word token learning. For more information, see [Hugging Face PR #1753](https://github.com/huggingface/tokenizers/pull/1753).
53
+
54
+ #### Domain Specific
55
+
56
+ As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains.
57
+
58
+ #### Large Added Token Set
59
+
60
+ Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary:
61
+
62
+ - HTML tags like `<span`
63
+ - Common Markdown elements like `#` and `##`
64
+ - Legal enumerations like `(a)`
65
+ - Academic and legal citations
66
+
67
+ #### Special Tokens
68
+
69
+ For both training and inference efficiency, we included special tokens suitable for both causal and masked language modeling tasks:
70
+
71
+ * `<|start|>`: `0`
72
+ * `<|end|>`: `1`
73
+ * `<|pad|>`: `2`
74
+ * `<|unk|>`: `3`
75
+ * `<|sep|>`: `4`
76
+ * `<|cls|>`: `5`
77
+ * `<|mask|>`: `6`
78
+ * `<|system|>`: `7`
79
+ * `</|system|>`: `8`
80
+ * `<|user|>`: `9`
81
+ * `</|user|>`: `10`
82
+ * `<|instruction|>`: `11`
83
+ * `</|instruction|>`: `12`
84
+
85
+ ### Examples
86
+
87
+ Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers:
88
+
89
+ ```text
90
+ Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment.
91
+
92
+ Standard BPE tokenization:
93
+ ["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."]
94
+
95
+ kl3m-005-multi-word-example-32k:
96
+ ["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."]
97
+ ```
98
+
99
+ Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units.
100
+
101
+ ### Replication
102
+
103
+ The entire data collection and preprocessing pipeline is being made available as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
104
+
105
+ The source code used to train the tokenizer is available on GitHub at:
106
+ [https://github.com/alea-institute/kl3m-tokenizers](https://github.com/alea-institute/kl3m-tokenizers)
107
+
108
+ ## Uses
109
+
110
+ This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation.
111
+
112
+ ### Recommendations
113
+
114
+ The `kl3m-005-multi-word-example-32k` tokenizer is recommended for:
115
+
116
+ - Legal or financial document processing where multi-word terms are common
117
+ - Applications where token compression is critical
118
+ - Research into multi-word token approaches
119
+ - Tasks requiring better semantic coherence in tokenization
120
+
121
+ For more traditional tokenization, consider the `kl3m-004-128k-cased` or other KL3M tokenizers.
122
+
123
+ ## How to Get Started with the Model
124
+
125
+ Use the code below to get started with the model:
126
+
127
+ ```python
128
+ from tokenizers import Tokenizer
129
+
130
+ tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k')
131
+
132
+ # Example showing multi-word tokens
133
+ text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment."
134
+ encoded = tokenizer.encode(text)
135
+ tokens = encoded.tokens
136
+
137
+ print(f"Token count: {len(tokens)}")
138
+ print("Tokens:", tokens)
139
+ ```
140
+
141
+ ## Citation
142
+
143
+ Tokenizer and dataset publications are pending.
144
+
145
+ ## Contact
146
+
147
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
148
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-tokenizers).
149
+
150
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)