DictaBERT-large-char-menaked: An open-source BERT-based model for adding diacritiziation marks ("nikud") to Hebrew texts

This model is a fine-tuned version of DictaBERT-large-char, dedicated to the task of adding nikud (diacritics) to Hebrew text.

The model was trained on a corpus of modern Hebrew texts manually diacritized by linguistic experts. As of 2025-03, this model provides SOTA performance on all modern Hebrew vocalization benchmarks as compared to all other open-source alternatives, as well as when compared with commercial generative LLMs.

Note: this model is trained to handle a wide variety of genres of modern Hebrew prose. However, it is not intended for earlier layers of Hebrew (e.g. Biblical, Rabbinic, Premodern), nor for poetic texts.

Sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-large-char-menaked')
model = AutoModel.from_pretrained('dicta-il/dictabert-large-char-menaked', trust_remote_code=True)

model.eval()

sentence = 'ื‘ืฉื ืช 1948 ื”ืฉืœื™ื ืืคืจื™ื ืงื™ืฉื•ืŸ ืืช ืœื™ืžื•ื“ื™ื• ื‘ืคื™ืกื•ืœ ืžืชื›ืช ื•ื‘ืชื•ืœื“ื•ืช ื”ืืžื ื•ืช ื•ื”ื—ืœ ืœืคืจืกื ืžืืžืจื™ื ื”ื•ืžื•ืจื™ืกื˜ื™ื™ื'
print(model.predict([sentence], tokenizer))

Output:

['ื‘ึผึดืฉืึฐื ึทืช 1948 ื”ึดืฉืึฐืœึดื™ื ืึถืคึฐืจึทื™ึดื ืงึดื™ืฉืื•ึนืŸ ืึถืช ืœึดืžึผื•ึผื“ึธื™ื• ื‘ึผึฐืคึดืกึผื•ึผืœ ืžึทืชึผึถื›ึถืช ื•ึผื‘ึฐืชื•ึนืœึฐื“ื•ึนืช ื”ึธืื‡ืžึผึธื ื•ึผืช ื•ึฐื”ึตื—ึตืœ ืœึฐืคึทืจึฐืกึตื ืžึทืึฒืžึธืจึดื™ื ื”ื•ึผืžื•ึนืจึดื™ืกึฐื˜ึดื™ึผึดื™ื']

Matres Lectionis (ืื™ืžื•ืช ืงืจื™ืื”)

As can be seen, the predict method automatically removed all the matres-lectionis (ืื™ืžื•ืช ืงืจื™ืื”). If you wish to keep them in, you can specify that to the predict function:

print(model.predict([sentence], tokenizer, mark_matres_lectionis = '*'))

Output:

['ื‘ึผึดืฉืึฐื ึทืช 1948 ื”ึดืฉืึฐืœึดื™ื ืึถืคึฐืจึทื™ึดื ืงึดื™ืฉืื•ึนืŸ ืึถืช ืœึดื™*ืžึผื•ึผื“ึธื™ื• ื‘ึผึฐืคึดื™*ืกึผื•ึผืœ ืžึทืชึผึถื›ึถืช ื•ึผื‘ึฐืชื•ึนืœึฐื“ื•ึนืช ื”ึธืื‡ืžึผึธื ื•ึผืช ื•ึฐื”ึตื—ึตืœ ืœึฐืคึทืจึฐืกึตื ืžึทืึฒืžึธืจึดื™ื ื”ื•ึผืžื•ึนืจึดื™ืกึฐื˜ึดื™ึผึดื™ื']

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
398
Safetensors
Model size
305M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support