BERT Tokenizer Extended for Dhivehi

This is a BertTokenizerFast built by extending the original bert-base-multilingual-cased with a large Dhivehi corpus. It retains full compatibility with English and other languages while adding wordpiece-level support for Dhivehi script.

How to Use

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")

text = "ދިވެހި މަޅި ރޯކުރަނީ The quick brown fox"
tokens = tokenizer.tokenize(text)
print(tokens)

Output:

['ދިވެހި', 'މަޅި', 'ރޯ', '##ކުރަނީ', 'The', 'quick', 'brown', 'f', '##ox']

Tokenizer Details

  • Base model: bert-base-multilingual-cased
  • Type: BertTokenizerFast
  • Vocab size: 150,000
  • Trained on: Cleaned Dhivehi monolingual corpus
  • Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]

Tokenization Comparison

Language Stock BERT Extended Tokenizer
English Perfect Perfect
Dhivehi UNKs Full Coverage

Clean Vocabulary

All tokens added are frequent (min freq ≥ 5), unused English tokens are preserved to avoid collisions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support