Advanced Thai Tokenizer V3

Overview

Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.

Performance

Overall Accuracy: 24/24 (100.0%)
Vocabulary Size: 35,590 tokens
Average Compression: 3.45 chars/token
UNK Ratio: 0%
Thai Character Coverage: 100%
Tested on: Real-world, mixed, and edge-case sentences
Training Corpus: combined_thai_corpus.txt (cleaned, deduplicated, multi-domain)

Key Features

✅ No Thai character corruption (no byte-level fallback, no normalization loss)
✅ Handles mixed Thai-English, numbers, and symbols
✅ Modern vocabulary (internet, technology, social, business)
✅ Efficient compression (subword, not word-level)
✅ Clean decoding without artifacts
✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
✅ Production-ready: tested, documented, and robust

Quick Start

from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")

Files

tokenizer.json — Main tokenizer file (HuggingFace format)
vocab.json — Vocabulary mapping
tokenizer_config.json — Transformers config
metadata.json — Performance and configuration details
usage_examples.json — Code examples
README.md — This file
combined_thai_corpus.txt — Training corpus (not included in repo, see dataset card)

Created: July 2025

Model Card for Advanced Thai Tokenizer V3

Model Details

Developed by: ZombitX64 (https://huggingface.co/ZombitX64)
Model type: Unigram (subword) tokenizer
Language(s): th (Thai), mixed Thai-English
License: Apache-2.0
Finetuned from model: N/A (trained from scratch)

Model Sources

Repository: https://huggingface.co/ZombitX64/Thaitokenizer

Uses

Direct Use

Tokenization for Thai LLMs, NLP, and downstream tasks
Preprocessing for text classification, NER, QA, summarization, etc.
Robust for mixed Thai-English, numbers, and social content

Downstream Use

Plug into HuggingFace Transformers pipelines
Use as tokenizer for Thai LLM pretraining/fine-tuning
Integrate with spaCy, PyThaiNLP, or custom pipelines

Out-of-Scope Use

Not a language model (no text generation by itself)
Not suitable for non-Thai-centric tasks

Bias, Risks, and Limitations

Trained on public Thai web/corpus data; may reflect real-world bias
Not guaranteed to cover rare dialects, slang, or OCR errors
No explicit filtering for toxic/biased content in corpus
Tokenizer does not understand context/meaning (no disambiguation)

Recommendations

For best results, use with LLMs or models trained on similar corpus
For sensitive/critical applications, review corpus and test thoroughly
For word-level tasks, use with context-aware models (NER, POS)

How to Get Started with the Model

from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")

Training Details

Training Data

Source: combined_thai_corpus.txt (cleaned, deduplicated, multi-domain Thai text)
Size: 71.7M
Preprocessing: Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback

Training Procedure

Tokenizer: HuggingFace Tokenizers (Unigram)
Vocab size: 35,590
Special tokens:
Pre-tokenizer: Punctuation only
No normalization, no post-processor, no decoder
Training regime: CPU, Python 3.11, single run, see script for details

Speeds, Sizes, Times

Training time: -
Checkpoint size: tokenizer.json ~[size] KB

Evaluation

Testing Data, Factors & Metrics

Testing data: Real-world Thai sentences, mixed content, edge cases
Metrics: Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
Results: 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token

Environmental Impact

Training on CPU, low energy usage
No large-scale GPU/TPU compute required

Technical Specifications

Model architecture: Unigram (subword) tokenizer
Software: tokenizers==0.15+, Python 3.11
Hardware: Standard CPU (no GPU required)

Citation

If you use this tokenizer, please cite:

@misc{zombitx64_thaitokenizer_v3_2025,
  author = {ZombitX64},
  title = {Advanced Thai Tokenizer V3},
  year = {2025},
  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
}

Model Card Authors

ZombitX64 (https://huggingface.co/ZombitX64)

Model Card Contact

For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.

ZombitX64
/

Thaitokenizer