BERT Tokenizer Extended for Dhivehi
This is a BertTokenizerFast
built by extending the original bert-base-multilingual-cased
with a large Dhivehi corpus. It retains full compatibility with English and other languages while adding wordpiece-level support for Dhivehi script.
How to Use
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("alakxender/bert-dhivehi-tokenizer-extended")
text = "ދިވެހި މަޅި ރޯކުރަނީ The quick brown fox"
tokens = tokenizer.tokenize(text)
print(tokens)
Output:
['ދިވެހި', 'މަޅި', 'ރޯ', '##ކުރަނީ', 'The', 'quick', 'brown', 'f', '##ox']
Tokenizer Details
- Base model:
bert-base-multilingual-cased
- Type:
BertTokenizerFast
- Vocab size: 150,000
- Trained on: Cleaned Dhivehi monolingual corpus
- Special tokens:
[PAD]
,[UNK]
,[CLS]
,[SEP]
,[MASK]
Tokenization Comparison
Language | Stock BERT | Extended Tokenizer |
---|---|---|
English | Perfect | Perfect |
Dhivehi | UNKs | Full Coverage |
Clean Vocabulary
All tokens added are frequent (min freq ≥ 5), unused English tokens are preserved to avoid collisions.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support