🇰🇭 Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)

BPE-MD-v3-SPM is a hybrid Khmer tokenizer that combines:

BPE-MD (Morphology-Driven) rules for Khmer word segmentation, and
SentencePiece BPE modeling for subword learning, coverage, and byte safety.

This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text. It handles Unicode normalization, symbols, and numerics gracefully — ideal for LLMs, translation models, or RAG systems.

🧠 Features

Hybrid design: BPE-MD (morphology) × SentencePiece (subword)
Script coverage: Khmer + Latin + Math + Digits
Vocab size: 16 100
Character coverage: 1.0
Includes: user-defined math and chemical tokens (√, ², ₁₀, H₂O, log₁₀, etc.)

🧩 Example usage

(from transformers import T5Tokenizer)

(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))

(text = "ខ្ញុំបានគណនាថា √25 + 3² = 34")

(print(tok.tokenize(text)))

(print(tok.decode(tok.encode(text))))

📊 Training details

Base: Khmer Morphology-Driven corpus (education, news, QA)
Algorithm: SentencePiece (BPE mode)
User symbols: Mathematical, scientific, and Khmer-digit patterns
Goal: Robust tokenization for LLM fine-tuning on Khmer + mixed-script data

📜 License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support