Extended mT5 Tokenizer with Dhivehi Support
This tokenizer extends google/mt5-base
by incorporating full Dhivehi language support, while preserving the original multilingual capabilities of mT5.
Based on a 300k vocab SentencePiece model trained on a combined corpus of multilingual samples + Dhivehi text.
Overview
- Base model:
google/mt5-base
- Tokenizer type: SentencePiece (unigram)
- Vocab size:
300,000
- Byte fallback: ✅ Enabled
- Normalization:
NFKC
Training Details
Detail | Value |
---|---|
Tokenizer | SentencePieceTrainer |
Model type | unigram |
Vocab size | 300000 |
Character coverage | 0.9995 |
Input sentence size | 5,000,000 |
Special tokens | <pad>, <unk>, <s>, </s> |
Byte fallback | Enabled |
Normalization | NFKC |
How to Use
from transformers import MT5Tokenizer
tokenizer = MT5Tokenizer.from_pretrained("your-username/mt5-tokenizer-truly-extended")
text = "ރިޔާސީ އިންތިހާބުގައި ވާދަކުރަށްވަން ނަޝީދު ހިޔާލު ހޯއްދަވަނީ"
tokens = tokenizer.tokenize(text)
print(tokens)
Improvements
Language | Tokenization | Result |
---|---|---|
Dhivehi | Fragmented in base | Improved |
Multilingual | Supported as before | Retained |
Mixed sentences | Fragile in base | Improved |
Round-Trip Decoding
The tokenizer supports round-trip decoding for Dhivehi:
ids = tokenizer.encode("ރިޔާސީ", add_special_tokens=False)
decoded = tokenizer.decode(ids)
assert decoded == "ރިޔާސީ"
Files
spiece.model
: The new SentencePiece model (300k vocab)tokenizer_config.json
: Updated for extended vocabspecial_tokens_map.json
: Preserved frommt5-base
Notes
- This tokenizer is designed for compatibility with
google/mt5-base
model. - If you want to fine-tune mT5 using this tokenizer, make sure to resize the model embeddings:
from transformers import MT5ForConditionalGeneration
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
model.resize_token_embeddings(len(tokenizer))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for alakxender/mt5-dhivehi-tokenizer-extended
Base model
google/mt5-base