Transformers
Divehi
dhivehi-mt5-tokenizer

Extended mT5 Tokenizer with Dhivehi Support

This tokenizer extends google/mt5-base by incorporating full Dhivehi language support, while preserving the original multilingual capabilities of mT5.

Based on a 300k vocab SentencePiece model trained on a combined corpus of multilingual samples + Dhivehi text.

Overview

  • Base model: google/mt5-base
  • Tokenizer type: SentencePiece (unigram)
  • Vocab size: 300,000
  • Byte fallback: ✅ Enabled
  • Normalization: NFKC

Training Details

Detail Value
Tokenizer SentencePieceTrainer
Model type unigram
Vocab size 300000
Character coverage 0.9995
Input sentence size 5,000,000
Special tokens <pad>, <unk>, <s>, </s>
Byte fallback Enabled
Normalization NFKC

How to Use

from transformers import MT5Tokenizer

tokenizer = MT5Tokenizer.from_pretrained("your-username/mt5-tokenizer-truly-extended")

text = "ރިޔާސީ އިންތިހާބުގައި ވާދަކުރަށްވަން ނަޝީދު ހިޔާލު ހޯއްދަވަނީ"
tokens = tokenizer.tokenize(text)
print(tokens)

Improvements

Language Tokenization Result
Dhivehi Fragmented in base Improved
Multilingual Supported as before Retained
Mixed sentences Fragile in base Improved

Round-Trip Decoding

The tokenizer supports round-trip decoding for Dhivehi:

ids = tokenizer.encode("ރިޔާސީ", add_special_tokens=False)
decoded = tokenizer.decode(ids)
assert decoded == "ރިޔާސީ"

Files

  • spiece.model: The new SentencePiece model (300k vocab)
  • tokenizer_config.json: Updated for extended vocab
  • special_tokens_map.json: Preserved from mt5-base

Notes

  • This tokenizer is designed for compatibility with google/mt5-base model.
  • If you want to fine-tune mT5 using this tokenizer, make sure to resize the model embeddings:
from transformers import MT5ForConditionalGeneration

model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
model.resize_token_embeddings(len(tokenizer))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alakxender/mt5-dhivehi-tokenizer-extended

Base model

google/mt5-base
Finetuned
(210)
this model

Dataset used to train alakxender/mt5-dhivehi-tokenizer-extended

Space using alakxender/mt5-dhivehi-tokenizer-extended 1