BERT Tokenizer Extended for Dhivehi

This is a BertTokenizerFast built by extending the original bert-base-multilingual-cased with a large Dhivehi corpus. It retains full compatibility with English and other languages while adding wordpiece-level support for Dhivehi script.

How to Use

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")

text = "ދިވެހި މަޅި ރޯކުރަނީ The quick brown fox"
tokens = tokenizer.tokenize(text)
print(tokens)

Output:

['ދިވެހި', 'މަޅި', 'ރޯ', '##ކުރަނީ', 'The', 'quick', 'brown', 'f', '##ox']

Tokenizer Details

Base model: bert-base-multilingual-cased
Type: BertTokenizerFast
Vocab size: 150,000
Trained on: Cleaned Dhivehi monolingual corpus
Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]

Tokenization Comparison

Language	Stock BERT	Extended Tokenizer
English	Perfect	Perfect
Dhivehi	UNKs	Full Coverage

Clean Vocabulary

All tokens added are frequent (min freq ≥ 5), unused English tokens are preserved to avoid collisions.