ZombitX64
/

Thaitokenizer

@@ -1,185 +1,190 @@
----
-language: th
-license: apache-2.0
-tags:
-  - thai
-  - tokenizer
-  - nlp
-  - subword
-model_type: unigram
-library_name: tokenizers
-pretty_name: Advanced Thai Tokenizer V3
----
-# Advanced Thai Tokenizer V3
-## Overview
-Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
-## Performance
-- **Overall Accuracy:** 24/24 (100.0%)
-- **Vocabulary Size:** 35,590 tokens
-- **Average Compression:** 3.45 chars/token
-- **UNK Ratio:** 0%
-- **Thai Character Coverage:** 100%
-- **Tested on:** Real-world, mixed, and edge-case sentences
-- **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
-## Key Features
-- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
-- ✅ Handles mixed Thai-English, numbers, and symbols
-- ✅ Modern vocabulary (internet, technology, social, business)
-- ✅ Efficient compression (subword, not word-level)
-- ✅ Clean decoding without artifacts
-- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
-- ✅ Production-ready: tested, documented, and robust
-## Quick Start
-```python
-from transformers import AutoTokenizer
-# Load tokenizer from HuggingFace Hub
-try:
-    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
-    text = "นั่งตาก ลม"
-    tokens = tokenizer.tokenize(text)
-    print(f"Tokens: {tokens}")
-    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
-    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
-    print(f"Original: {text}")
-    print(f"Decoded: {decoded}")
-except Exception as e:
-    print(f"Error loading tokenizer: {e}")
-```
-## Files
-- `tokenizer.json` — Main tokenizer file (HuggingFace format)
-- `vocab.json` — Vocabulary mapping
-- `tokenizer_config.json` — Transformers config
-- `metadata.json` — Performance and configuration details
-- `usage_examples.json` — Code examples
-- `README.md` — This file
-- `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
-Created: July 2025
----
-# Model Card for Advanced Thai Tokenizer V3
-## Model Details
-**Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
-**Model type:** Unigram (subword) tokenizer
-**Language(s):** th (Thai), mixed Thai-English
-**License:** Apache-2.0
-**Finetuned from model:** N/A (trained from scratch)
-### Model Sources
-- **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
-## Uses
-### Direct Use
-- Tokenization for Thai LLMs, NLP, and downstream tasks
-- Preprocessing for text classification, NER, QA, summarization, etc.
-- Robust for mixed Thai-English, numbers, and social content
-### Downstream Use
-- Plug into HuggingFace Transformers pipelines
-- Use as tokenizer for Thai LLM pretraining/fine-tuning
-- Integrate with spaCy, PyThaiNLP, or custom pipelines
-### Out-of-Scope Use
-- Not a language model (no text generation by itself)
-- Not suitable for non-Thai-centric tasks
-## Bias, Risks, and Limitations
-- Trained on public Thai web/corpus data; may reflect real-world bias
-- Not guaranteed to cover rare dialects, slang, or OCR errors
-- No explicit filtering for toxic/biased content in corpus
-- Tokenizer does not understand context/meaning (no disambiguation)
-### Recommendations
-- For best results, use with LLMs or models trained on similar corpus
-- For sensitive/critical applications, review corpus and test thoroughly
-- For word-level tasks, use with context-aware models (NER, POS)
-## How to Get Started with the Model
-```python
-from transformers import AutoTokenizer
-# Load tokenizer from HuggingFace Hub
-try:
-    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
-    text = "นั่งตาก ลม"
-    tokens = tokenizer.tokenize(text)
-    print(f"Tokens: {tokens}")
-    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
-    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
-    print(f"Original: {text}")
-    print(f"Decoded: {decoded}")
-except Exception as e:
-    print(f"Error loading tokenizer: {e}")
-```
-## Training Details
-### Training Data
-- **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
-- **Size:** 71.7M
-- **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
-### Training Procedure
-- **Tokenizer:** HuggingFace Tokenizers (Unigram)
-- **Vocab size:** 35,590
-- **Special tokens:** <unk>
-- **Pre-tokenizer:** Punctuation only
-- **No normalization, no post-processor, no decoder**
-- **Training regime:** CPU, Python 3.11, single run, see script for details
-### Speeds, Sizes, Times
-- **Training time:** -
-- **Checkpoint size:** tokenizer.json ~[size] KB
-## Evaluation
-### Testing Data, Factors & Metrics
-- **Testing data:** Real-world Thai sentences, mixed content, edge cases
-- **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
-- **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
-## Environmental Impact
-- Training on CPU, low energy usage
-- No large-scale GPU/TPU compute required
-## Technical Specifications
-- **Model architecture:** Unigram (subword) tokenizer
-- **Software:** tokenizers==0.15+, Python 3.11
-- **Hardware:** Standard CPU (no GPU required)
-## Citation
-If you use this tokenizer, please cite:
-```
-@misc{zombitx64_thaitokenizer_v3_2025,
-  author = {ZombitX64},
-  title = {Advanced Thai Tokenizer V3},
-  year = {2025},
-  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
-}
-```
-## Model Card Authors
-- ZombitX64 (https://huggingface.co/ZombitX64)
-## Model Card Contact
 For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.

+---
+language: th
+license: apache-2.0
+tags:
+- thai
+- tokenizer
+- nlp
+- subword
+model_type: unigram
+library_name: tokenizers
+pretty_name: Advanced Thai Tokenizer V3
+datasets:
+- ZombitX64/Thai-corpus-word
+metrics:
+- accuracy
+- character
+---
+# Advanced Thai Tokenizer V3
+## Overview
+Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.
+## Performance
+- **Overall Accuracy:** 24/24 (100.0%)
+- **Vocabulary Size:** 35,590 tokens
+- **Average Compression:** 3.45 chars/token
+- **UNK Ratio:** 0%
+- **Thai Character Coverage:** 100%
+- **Tested on:** Real-world, mixed, and edge-case sentences
+- **Training Corpus:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain)
+## Key Features
+- ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
+- ✅ Handles mixed Thai-English, numbers, and symbols
+- ✅ Modern vocabulary (internet, technology, social, business)
+- ✅ Efficient compression (subword, not word-level)
+- ✅ Clean decoding without artifacts
+- ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
+- ✅ Production-ready: tested, documented, and robust
+## Quick Start
+```python
+from transformers import AutoTokenizer
+# Load tokenizer from HuggingFace Hub
+try:
+    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
+    text = "นั่งตาก ลม"
+    tokens = tokenizer.tokenize(text)
+    print(f"Tokens: {tokens}")
+    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
+    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
+    print(f"Original: {text}")
+    print(f"Decoded: {decoded}")
+except Exception as e:
+    print(f"Error loading tokenizer: {e}")
+```
+## Files
+- `tokenizer.json` — Main tokenizer file (HuggingFace format)
+- `vocab.json` — Vocabulary mapping
+- `tokenizer_config.json` — Transformers config
+- `metadata.json` — Performance and configuration details
+- `usage_examples.json` — Code examples
+- `README.md` — This file
+- `combined_thai_corpus.txt` — Training corpus (not included in repo, see dataset card)
+Created: July 2025
+---
+# Model Card for Advanced Thai Tokenizer V3
+## Model Details
+**Developed by:** ZombitX64 (https://huggingface.co/ZombitX64)
+**Model type:** Unigram (subword) tokenizer
+**Language(s):** th (Thai), mixed Thai-English
+**License:** Apache-2.0
+**Finetuned from model:** N/A (trained from scratch)
+### Model Sources
+- **Repository:** https://huggingface.co/ZombitX64/Thaitokenizer
+## Uses
+### Direct Use
+- Tokenization for Thai LLMs, NLP, and downstream tasks
+- Preprocessing for text classification, NER, QA, summarization, etc.
+- Robust for mixed Thai-English, numbers, and social content
+### Downstream Use
+- Plug into HuggingFace Transformers pipelines
+- Use as tokenizer for Thai LLM pretraining/fine-tuning
+- Integrate with spaCy, PyThaiNLP, or custom pipelines
+### Out-of-Scope Use
+- Not a language model (no text generation by itself)
+- Not suitable for non-Thai-centric tasks
+## Bias, Risks, and Limitations
+- Trained on public Thai web/corpus data; may reflect real-world bias
+- Not guaranteed to cover rare dialects, slang, or OCR errors
+- No explicit filtering for toxic/biased content in corpus
+- Tokenizer does not understand context/meaning (no disambiguation)
+### Recommendations
+- For best results, use with LLMs or models trained on similar corpus
+- For sensitive/critical applications, review corpus and test thoroughly
+- For word-level tasks, use with context-aware models (NER, POS)
+## How to Get Started with the Model
+```python
+from transformers import AutoTokenizer
+# Load tokenizer from HuggingFace Hub
+try:
+    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
+    text = "นั่งตาก ลม"
+    tokens = tokenizer.tokenize(text)
+    print(f"Tokens: {tokens}")
+    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
+    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
+    print(f"Original: {text}")
+    print(f"Decoded: {decoded}")
+except Exception as e:
+    print(f"Error loading tokenizer: {e}")
+```
+## Training Details
+### Training Data
+- **Source:** `combined_thai_corpus.txt` (cleaned, deduplicated, multi-domain Thai text)
+- **Size:** 71.7M
+- **Preprocessing:** Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback
+### Training Procedure
+- **Tokenizer:** HuggingFace Tokenizers (Unigram)
+- **Vocab size:** 35,590
+- **Special tokens:** <unk>
+- **Pre-tokenizer:** Punctuation only
+- **No normalization, no post-processor, no decoder**
+- **Training regime:** CPU, Python 3.11, single run, see script for details
+### Speeds, Sizes, Times
+- **Training time:** -
+- **Checkpoint size:** tokenizer.json ~[size] KB
+## Evaluation
+### Testing Data, Factors & Metrics
+- **Testing data:** Real-world Thai sentences, mixed content, edge cases
+- **Metrics:** Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
+- **Results:** 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token
+## Environmental Impact
+- Training on CPU, low energy usage
+- No large-scale GPU/TPU compute required
+## Technical Specifications
+- **Model architecture:** Unigram (subword) tokenizer
+- **Software:** tokenizers==0.15+, Python 3.11
+- **Hardware:** Standard CPU (no GPU required)
+## Citation
+If you use this tokenizer, please cite:
+```
+@misc{zombitx64_thaitokenizer_v3_2025,
+  author = {ZombitX64},
+  title = {Advanced Thai Tokenizer V3},
+  year = {2025},
+  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
+}
+```
+## Model Card Authors
+- ZombitX64 (https://huggingface.co/ZombitX64)
+## Model Card Contact
 For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.