Advanced Thai Tokenizer V3

Overview

Advanced Thai language tokenizer (Unigram, HuggingFace-compatible) trained on a large, cleaned, real-world Thai corpus. Handles Thai, mixed Thai-English, numbers, and modern vocabulary. Designed for LLM/NLP use, with robust roundtrip accuracy and no byte-level artifacts.

Performance

  • Overall Accuracy: 24/24 (100.0%)
  • Vocabulary Size: 35,590 tokens
  • Average Compression: 3.45 chars/token
  • UNK Ratio: 0%
  • Thai Character Coverage: 100%
  • Tested on: Real-world, mixed, and edge-case sentences
  • Training Corpus: combined_thai_corpus.txt (cleaned, deduplicated, multi-domain)

Key Features

  • ✅ No Thai character corruption (no byte-level fallback, no normalization loss)
  • ✅ Handles mixed Thai-English, numbers, and symbols
  • ✅ Modern vocabulary (internet, technology, social, business)
  • ✅ Efficient compression (subword, not word-level)
  • ✅ Clean decoding without artifacts
  • ✅ HuggingFace-compatible (tokenizer.json, vocab.json, config)
  • ✅ Production-ready: tested, documented, and robust

Quick Start

from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")

Files

  • tokenizer.json — Main tokenizer file (HuggingFace format)
  • vocab.json — Vocabulary mapping
  • tokenizer_config.json — Transformers config
  • metadata.json — Performance and configuration details
  • usage_examples.json — Code examples
  • README.md — This file
  • combined_thai_corpus.txt — Training corpus (not included in repo, see dataset card)

Created: July 2025


Model Card for Advanced Thai Tokenizer V3

Model Details

Developed by: ZombitX64 (https://huggingface.co/ZombitX64)
Model type: Unigram (subword) tokenizer
Language(s): th (Thai), mixed Thai-English
License: Apache-2.0
Finetuned from model: N/A (trained from scratch)

Model Sources

Uses

Direct Use

  • Tokenization for Thai LLMs, NLP, and downstream tasks
  • Preprocessing for text classification, NER, QA, summarization, etc.
  • Robust for mixed Thai-English, numbers, and social content

Downstream Use

  • Plug into HuggingFace Transformers pipelines
  • Use as tokenizer for Thai LLM pretraining/fine-tuning
  • Integrate with spaCy, PyThaiNLP, or custom pipelines

Out-of-Scope Use

  • Not a language model (no text generation by itself)
  • Not suitable for non-Thai-centric tasks

Bias, Risks, and Limitations

  • Trained on public Thai web/corpus data; may reflect real-world bias
  • Not guaranteed to cover rare dialects, slang, or OCR errors
  • No explicit filtering for toxic/biased content in corpus
  • Tokenizer does not understand context/meaning (no disambiguation)

Recommendations

  • For best results, use with LLMs or models trained on similar corpus
  • For sensitive/critical applications, review corpus and test thoroughly
  • For word-level tasks, use with context-aware models (NER, POS)

How to Get Started with the Model

from transformers import AutoTokenizer

# Load tokenizer from HuggingFace Hub
try:
    tokenizer = AutoTokenizer.from_pretrained("ZombitX64/Thaitokenizer")
    text = "นั่งตาก ลม"
    tokens = tokenizer.tokenize(text)
    print(f"Tokens: {tokens}")
    encoding = tokenizer(text, return_tensors=None, add_special_tokens=False)
    decoded = tokenizer.decode(encoding['input_ids'], skip_special_tokens=True)
    print(f"Original: {text}")
    print(f"Decoded: {decoded}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")

Training Details

Training Data

  • Source: combined_thai_corpus.txt (cleaned, deduplicated, multi-domain Thai text)
  • Size: 71.7M
  • Preprocessing: Remove duplicates, normalize encoding, minimal cleaning, no normalization, no byte fallback

Training Procedure

  • Tokenizer: HuggingFace Tokenizers (Unigram)
  • Vocab size: 35,590
  • Special tokens:
  • Pre-tokenizer: Punctuation only
  • No normalization, no post-processor, no decoder
  • Training regime: CPU, Python 3.11, single run, see script for details

Speeds, Sizes, Times

  • Training time: -
  • Checkpoint size: tokenizer.json ~[size] KB

Evaluation

Testing Data, Factors & Metrics

  • Testing data: Real-world Thai sentences, mixed content, edge cases
  • Metrics: Roundtrip accuracy, UNK ratio, Thai character coverage, compression ratio
  • Results: 100% roundtrip, 0% UNK, 100% Thai char coverage, 3.45 chars/token

Environmental Impact

  • Training on CPU, low energy usage
  • No large-scale GPU/TPU compute required

Technical Specifications

  • Model architecture: Unigram (subword) tokenizer
  • Software: tokenizers==0.15+, Python 3.11
  • Hardware: Standard CPU (no GPU required)

Citation

If you use this tokenizer, please cite:

@misc{zombitx64_thaitokenizer_v3_2025,
  author = {ZombitX64},
  title = {Advanced Thai Tokenizer V3},
  year = {2025},
  howpublished = {\\url{https://huggingface.co/ZombitX64/Thaitokenizer}}
}

Model Card Authors

Model Card Contact

For questions or feedback, open an issue on the HuggingFace repo or contact ZombitX64 via HuggingFace.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ZombitX64/Thaitokenizer