πŸ‡°πŸ‡­ Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)

BPE-MD-v3-SPM is a hybrid Khmer tokenizer that combines:

  • BPE-MD (Morphology-Driven) rules for Khmer word segmentation, and
  • SentencePiece BPE modeling for subword learning, coverage, and byte safety.

This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text. It handles Unicode normalization, symbols, and numerics gracefully β€” ideal for LLMs, translation models, or RAG systems.


🧠 Features

  • Hybrid design: BPE-MD (morphology) Γ— SentencePiece (subword)
  • Script coverage: Khmer + Latin + Math + Digits
  • Vocab size: 16 100
  • Character coverage: 1.0
  • Includes: user-defined math and chemical tokens (√, Β², ₁₀, Hβ‚‚O, log₁₀, etc.)

🧩 Example usage

(from transformers import T5Tokenizer)

(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))

(text = "αžαŸ’αž‰αž»αŸ†αž”αžΆαž“αž‚αžŽαž“αžΆαžαžΆ √25 + 3Β² = 34")

(print(tok.tokenize(text)))

(print(tok.decode(tok.encode(text))))


πŸ“Š Training details

  • Base: Khmer Morphology-Driven corpus (education, news, QA)
  • Algorithm: SentencePiece (BPE mode)
  • User symbols: Mathematical, scientific, and Khmer-digit patterns
  • Goal: Robust tokenization for LLM fine-tuning on Khmer + mixed-script data

πŸ“œ License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support