JonusNattapong commited on
Commit
16fe072
·
verified ·
1 Parent(s): 2076fd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -184
README.md CHANGED
@@ -1,185 +1,190 @@
1
- ---
2
- language: th
3
- license: apache-2.0
4
- tags:
5
- - thai
6
- - tokenizer
7
- - nlp
8
- - subword
9
- model_type: unigram
10
- library_name: tokenizers
11
- pretty_name: Advanced Thai Tokenizer V3
12
- ---
13
-
14
- # Advanced Thai Tokenizer V3
15
-
16
- ## Overview
17
- Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
18
-
19
- ## Performance
20
- - **Overall Accuracy:** 24/24 (100.0%)
21
- - **Vocabulary Size:** 35,590 tokens
22
- - **Average Compression:** 3.45 chars/token
23
- - **UNK Ratio:** 0%
24
- - **Thai Character Coverage:** 100%
25
- - **Tested on:** Real-world, mixed, and edge-case sentences
26
- - **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
27
-
28
- ## Key Features
29
- - ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
30
- - Handles mixed Thai-English, numbers, and symbols
31
- - Modern vocabulary (internet, technology, social, business)
32
- - ✅ Efficient compression (subword, not word-level)
33
- - Clean decoding without artifacts
34
- - ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
35
- - ✅ Production-ready: tested, documented, and robust
36
-
37
- ## Quick Start
38
- ```python
39
- from transformers import AutoTokenizer
40
-
41
- # Load tokenizer from HuggingFace Hub
42
- try:
43
- tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
44
- text = "นั่งตาก ลม"
45
- tokens = tokenizer.tokenize(text)
46
- print(f"Tokens: {tokens}")
47
- encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
48
- decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
49
- print(f"Original: {text}")
50
- print(f"Decoded: {decoded}")
51
- except Exception as e:
52
- print(f"Error loading tokenizer: {e}")
53
- ```
54
-
55
- ## Files
56
- - `tokenizer.json` Main tokenizer file (HuggingFace format)
57
- - `vocab.json` Vocabulary mapping
58
- - `tokenizer_config.json` — Transformers config
59
- - `metadata.json` — Performance and configuration details
60
- - `usage_examples.json` — Code examples
61
- - `README.md` — This file
62
- - `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
63
-
64
- Created: July 2025
65
-
66
- ---
67
-
68
- # Model Card for Advanced Thai Tokenizer V3
69
-
70
- ## Model Details
71
-
72
- **Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
73
- **Model type:** Unigram (subword) tokenizer
74
- **Language(s):** th (Thai), mixed Thai-English
75
- **License:** Apache-2.0
76
- **Finetuned from model:** N/A (trained from scratch)
77
-
78
- ### Model Sources
79
- - **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
80
-
81
- ## Uses
82
-
83
- ### Direct Use
84
- - Tokenization for Thai LLMs, NLP, and downstream tasks
85
- - Preprocessing for text classification, NER, QA, summarization, etc.
86
- - Robust for mixed Thai-English, numbers, and social content
87
-
88
- ### Downstream Use
89
- - Plug into HuggingFace Transformers pipelines
90
- - Use as tokenizer for Thai LLM pretraining/fine-tuning
91
- - Integrate with spaCy, PyThaiNLP, or custom pipelines
92
-
93
- ### Out-of-Scope Use
94
- - Not a language model (no text generation by itself)
95
- - Not suitable for non-Thai-centric tasks
96
-
97
- ## Bias, Risks, and Limitations
98
-
99
- - Trained on public Thai web/corpus data; may reflect real-world bias
100
- - Not guaranteed to cover rare dialects, slang, or OCR errors
101
- - No explicit filtering for toxic/biased content in corpus
102
- - Tokenizer does not understand context/meaning (no disambiguation)
103
-
104
- ### Recommendations
105
-
106
- - For best results, use with LLMs or models trained on similar corpus
107
- - For sensitive/critical applications, review corpus and test thoroughly
108
- - For word-level tasks, use with context-aware models (NER, POS)
109
-
110
- ## How to Get Started with the Model
111
-
112
- ```python
113
- from transformers import AutoTokenizer
114
-
115
- # Load tokenizer from HuggingFace Hub
116
- try:
117
- tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
118
- text = "นั่งตาก ลม"
119
- tokens = tokenizer.tokenize(text)
120
- print(f"Tokens: {tokens}")
121
- encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
122
- decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
123
- print(f"Original: {text}")
124
- print(f"Decoded: {decoded}")
125
- except Exception as e:
126
- print(f"Error loading tokenizer: {e}")
127
- ```
128
-
129
- ## Training Details
130
-
131
- ### Training Data
132
- - **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
133
- - **Size:** 71.7M
134
- - **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
135
-
136
- ### Training Procedure
137
- - **Tokenizer:** HuggingFace Tokenizers (Unigram)
138
- - **Vocab size:** 35,590
139
- - **Special tokens:** <unk>
140
- - **Pre-tokenizer:** Punctuation only
141
- - **No normalization, no post-processor, no decoder**
142
- - **Training regime:** CPU, Python 3.11, single run, see script for details
143
-
144
- ### Speeds, Sizes, Times
145
- - **Training time:** -
146
- - **Checkpoint size:** tokenizer.json ~[size] KB
147
-
148
- ## Evaluation
149
-
150
- ### Testing Data, Factors & Metrics
151
- - **Testing data:** Real-world Thai sentences, mixed content, edge cases
152
- - **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
153
- - **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
154
-
155
- ## Environmental Impact
156
-
157
- - Training on CPU, low energy usage
158
- - No large-scale GPU/TPU compute required
159
-
160
- ## Technical Specifications
161
-
162
- - **Model architecture:** Unigram (subword) tokenizer
163
- - **Software:** tokenizers==0.15+, Python 3.11
164
- - **Hardware:** Standard CPU (no GPU required)
165
-
166
- ## Citation
167
-
168
- If you use this tokenizer, please cite:
169
-
170
- ```
171
- @misc{zombitx64_thaitokenizer_v3_2025,
172
- author = {ZombitX64},
173
- title = {Advanced Thai Tokenizer V3},
174
- year = {2025},
175
- howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
176
- }
177
- ```
178
-
179
- ## Model Card Authors
180
-
181
- - ZombitX64 (https://huggingface.co/ZombitX64)
182
-
183
- ## Model Card Contact
184
-
 
 
 
 
 
185
  For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.
 
1
+ ---
2
+ language: th
3
+ license: apache-2.0
4
+ tags:
5
+ - thai
6
+ - tokenizer
7
+ - nlp
8
+ - subword
9
+ model_type: unigram
10
+ library_name: tokenizers
11
+ pretty_name: Advanced Thai Tokenizer V3
12
+ datasets:
13
+ - ZombitX64/Thai-corpus-word
14
+ metrics:
15
+ - accuracy
16
+ - character
17
+ ---
18
+
19
+ # Advanced Thai Tokenizer V3
20
+
21
+ ## Overview
22
+ Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
23
+
24
+ ## Performance
25
+ - **Overall Accuracy:** 24/24 (100.0%)
26
+ - **Vocabulary Size:** 35,590 tokens
27
+ - **Average Compression:** 3.45 chars/token
28
+ - **UNK Ratio:** 0%
29
+ - **Thai Character Coverage:** 100%
30
+ - **Tested on:** Real-world, mixed, and edge-case sentences
31
+ - **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
32
+
33
+ ## Key Features
34
+ - ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
35
+ - ✅ Handles mixed Thai-English, numbers, and symbols
36
+ - ✅ Modern vocabulary (internet, technology, social, business)
37
+ - Efficient compression (subword, not word-level)
38
+ - ✅ Clean decoding without artifacts
39
+ - HuggingFace-compatible (tokenizer.json, vocab.json, config)
40
+ - ✅ Production-ready: tested, documented, and robust
41
+
42
+ ## Quick Start
43
+ ```python
44
+ from transformers import AutoTokenizer
45
+
46
+ # Load tokenizer from HuggingFace Hub
47
+ try:
48
+ tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
49
+ text = "นั่งตาก ลม"
50
+ tokens = tokenizer.tokenize(text)
51
+ print(f"Tokens: {tokens}")
52
+ encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
53
+ decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
54
+ print(f"Original: {text}")
55
+ print(f"Decoded: {decoded}")
56
+ except Exception as e:
57
+ print(f"Error loading tokenizer: {e}")
58
+ ```
59
+
60
+ ## Files
61
+ - `tokenizer.json` — Main tokenizer file (HuggingFace format)
62
+ - `vocab.json` — Vocabulary mapping
63
+ - `tokenizer_config.json` — Transformers config
64
+ - `metadata.json` — Performance and configuration details
65
+ - `usage_examples.json` — Code examples
66
+ - `README.md` — This file
67
+ - `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
68
+
69
+ Created: July 2025
70
+
71
+ ---
72
+
73
+ # Model Card for Advanced Thai Tokenizer V3
74
+
75
+ ## Model Details
76
+
77
+ **Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
78
+ **Model type:** Unigram (subword) tokenizer
79
+ **Language(s):** th (Thai), mixed Thai-English
80
+ **License:** Apache-2.0
81
+ **Finetuned from model:** N/A (trained from scratch)
82
+
83
+ ### Model Sources
84
+ - **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
85
+
86
+ ## Uses
87
+
88
+ ### Direct Use
89
+ - Tokenization for Thai LLMs, NLP, and downstream tasks
90
+ - Preprocessing for text classification, NER, QA, summarization, etc.
91
+ - Robust for mixed Thai-English, numbers, and social content
92
+
93
+ ### Downstream Use
94
+ - Plug into HuggingFace Transformers pipelines
95
+ - Use as tokenizer for Thai LLM pretraining/fine-tuning
96
+ - Integrate with spaCy, PyThaiNLP, or custom pipelines
97
+
98
+ ### Out-of-Scope Use
99
+ - Not a language model (no text generation by itself)
100
+ - Not suitable for non-Thai-centric tasks
101
+
102
+ ## Bias, Risks, and Limitations
103
+
104
+ - Trained on public Thai web/corpus data; may reflect real-world bias
105
+ - Not guaranteed to cover rare dialects, slang, or OCR errors
106
+ - No explicit filtering for toxic/biased content in corpus
107
+ - Tokenizer does not understand context/meaning (no disambiguation)
108
+
109
+ ### Recommendations
110
+
111
+ - For best results, use with LLMs or models trained on similar corpus
112
+ - For sensitive/critical applications, review corpus and test thoroughly
113
+ - For word-level tasks, use with context-aware models (NER, POS)
114
+
115
+ ## How to Get Started with the Model
116
+
117
+ ```python
118
+ from transformers import AutoTokenizer
119
+
120
+ # Load tokenizer from HuggingFace Hub
121
+ try:
122
+ tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
123
+ text = "นั่งตาก ลม"
124
+ tokens = tokenizer.tokenize(text)
125
+ print(f"Tokens: {tokens}")
126
+ encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
127
+ decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
128
+ print(f"Original: {text}")
129
+ print(f"Decoded: {decoded}")
130
+ except Exception as e:
131
+ print(f"Error loading tokenizer: {e}")
132
+ ```
133
+
134
+ ## Training Details
135
+
136
+ ### Training Data
137
+ - **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
138
+ - **Size:** 71.7M
139
+ - **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
140
+
141
+ ### Training Procedure
142
+ - **Tokenizer:** HuggingFace Tokenizers (Unigram)
143
+ - **Vocab size:** 35,590
144
+ - **Special tokens:** <unk>
145
+ - **Pre-tokenizer:** Punctuation only
146
+ - **No normalization, no post-processor, no decoder**
147
+ - **Training regime:** CPU, Python 3.11, single run, see script for details
148
+
149
+ ### Speeds, Sizes, Times
150
+ - **Training time:** -
151
+ - **Checkpoint size:** tokenizer.json ~[size] KB
152
+
153
+ ## Evaluation
154
+
155
+ ### Testing Data, Factors & Metrics
156
+ - **Testing data:** Real-world Thai sentences, mixed content, edge cases
157
+ - **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
158
+ - **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
159
+
160
+ ## Environmental Impact
161
+
162
+ - Training on CPU, low energy usage
163
+ - No large-scale GPU/TPU compute required
164
+
165
+ ## Technical Specifications
166
+
167
+ - **Model architecture:** Unigram (subword) tokenizer
168
+ - **Software:** tokenizers==0.15+, Python 3.11
169
+ - **Hardware:** Standard CPU (no GPU required)
170
+
171
+ ## Citation
172
+
173
+ If you use this tokenizer, please cite:
174
+
175
+ ```
176
+ @misc{zombitx64_thaitokenizer_v3_2025,
177
+ author = {ZombitX64},
178
+ title = {Advanced Thai Tokenizer V3},
179
+ year = {2025},
180
+ howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
181
+ }
182
+ ```
183
+
184
+ ## Model Card Authors
185
+
186
+ - ZombitX64 (https://huggingface.co/ZombitX64)
187
+
188
+ ## Model Card Contact
189
+
190
  For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.