DictaBERT-large-char-menaked: An open-source BERT-based model for adding diacritiziation marks ("nikud") to Hebrew texts
This model is a fine-tuned version of DictaBERT-large-char, dedicated to the task of adding nikud (diacritics) to Hebrew text.
The model was trained on a corpus of modern Hebrew texts manually diacritized by linguistic experts. As of 2025-03, this model provides SOTA performance on all modern Hebrew vocalization benchmarks as compared to all other open-source alternatives, as well as when compared with commercial generative LLMs.
Note: this model is trained to handle a wide variety of genres of modern Hebrew prose. However, it is not intended for earlier layers of Hebrew (e.g. Biblical, Rabbinic, Premodern), nor for poetic texts.
Sample usage:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-large-char-menaked')
model = AutoModel.from_pretrained('dicta-il/dictabert-large-char-menaked', trust_remote_code=True)
model.eval()
sentence = 'ืืฉื ืช 1948 ืืฉืืื ืืคืจืื ืงืืฉืื ืืช ืืืืืืื ืืคืืกืื ืืชืืช ืืืชืืืืืช ืืืื ืืช ืืืื ืืคืจืกื ืืืืจืื ืืืืืจืืกืืืื'
print(model.predict([sentence], tokenizer))
Output:
['ืึผึดืฉืึฐื ึทืช 1948 ืึดืฉืึฐืึดืื ืึถืคึฐืจึทืึดื ืงึดืืฉืืึนื ืึถืช ืึดืึผืึผืึธืื ืึผึฐืคึดืกึผืึผื ืึทืชึผึถืึถืช ืึผืึฐืชืึนืึฐืืึนืช ืึธืืืึผึธื ืึผืช ืึฐืึตืึตื ืึฐืคึทืจึฐืกึตื ืึทืึฒืึธืจึดืื ืืึผืืึนืจึดืืกึฐืึดืึผึดืื']
Matres Lectionis (ืืืืืช ืงืจืืื)
As can be seen, the predict method automatically removed all the matres-lectionis (ืืืืืช ืงืจืืื). If you wish to keep them in, you can specify that to the predict function:
print(model.predict([sentence], tokenizer, mark_matres_lectionis = '*'))
Output:
['ืึผึดืฉืึฐื ึทืช 1948 ืึดืฉืึฐืึดืื ืึถืคึฐืจึทืึดื ืงึดืืฉืืึนื ืึถืช ืึดื*ืึผืึผืึธืื ืึผึฐืคึดื*ืกึผืึผื ืึทืชึผึถืึถืช ืึผืึฐืชืึนืึฐืืึนืช ืึธืืืึผึธื ืึผืช ืึฐืึตืึตื ืึฐืคึทืจึฐืกึตื ืึทืึฒืึธืจึดืื ืืึผืืึนืจึดืืกึฐืึดืึผึดืื']
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
- Downloads last month
- 398