π°π Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)
BPE-MD-v3-SPM is a hybrid Khmer tokenizer that combines:
- BPE-MD (Morphology-Driven) rules for Khmer word segmentation, and
- SentencePiece BPE modeling for subword learning, coverage, and byte safety.
This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text. It handles Unicode normalization, symbols, and numerics gracefully β ideal for LLMs, translation models, or RAG systems.
π§ Features
- Hybrid design: BPE-MD (morphology) Γ SentencePiece (subword)
- Script coverage: Khmer + Latin + Math + Digits
- Vocab size: 16 100
- Character coverage: 1.0
- Includes: user-defined math and chemical tokens (β, Β², ββ, HβO, logββ, etc.)
π§© Example usage
(from transformers import T5Tokenizer)
(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))
(text = "αααα»αααΆαααααΆααΆ β25 + 3Β² = 34")
(print(tok.tokenize(text)))
(print(tok.decode(tok.encode(text))))
π Training details
- Base: Khmer Morphology-Driven corpus (education, news, QA)
- Algorithm: SentencePiece (BPE mode)
- User symbols: Mathematical, scientific, and Khmer-digit patterns
- Goal: Robust tokenization for LLM fine-tuning on Khmer + mixed-script data
π License
MIT
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support