Shaltiel's picture
Update README.md
d311fbf verified
metadata
license: cc-by-4.0
language:
  - he
inference: false

DictaBERT-large-char-menaked: An open-source BERT-based model for adding diacritiziation marks ("nikud") to Hebrew texts

This model is a fine-tuned version of DictaBERT-large-char, dedicated to the task of adding nikud (diacritics) to Hebrew text.

The model was trained on a corpus of modern Hebrew texts manually diacritized by linguistic experts. As of 2025-03, this model provides SOTA performance on all modern Hebrew vocalization benchmarks as compared to all other open-source alternatives, as well as when compared with commercial generative LLMs.

Note: this model is trained to handle a wide variety of genres of modern Hebrew prose. However, it is not intended for earlier layers of Hebrew (e.g. Biblical, Rabbinic, Premodern), nor for poetic texts.

Sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-large-char-menaked')
model = AutoModel.from_pretrained('dicta-il/dictabert-large-char-menaked', trust_remote_code=True)

model.eval()

sentence = 'ื‘ืฉื ืช 1948 ื”ืฉืœื™ื ืืคืจื™ื ืงื™ืฉื•ืŸ ืืช ืœื™ืžื•ื“ื™ื• ื‘ืคื™ืกื•ืœ ืžืชื›ืช ื•ื‘ืชื•ืœื“ื•ืช ื”ืืžื ื•ืช ื•ื”ื—ืœ ืœืคืจืกื ืžืืžืจื™ื ื”ื•ืžื•ืจื™ืกื˜ื™ื™ื'
print(model.predict([sentence], tokenizer))

Output:

['ื‘ึผึดืฉืึฐื ึทืช 1948 ื”ึดืฉืึฐืœึดื™ื ืึถืคึฐืจึทื™ึดื ืงึดื™ืฉืื•ึนืŸ ืึถืช ืœึดืžึผื•ึผื“ึธื™ื• ื‘ึผึฐืคึดืกึผื•ึผืœ ืžึทืชึผึถื›ึถืช ื•ึผื‘ึฐืชื•ึนืœึฐื“ื•ึนืช ื”ึธืื‡ืžึผึธื ื•ึผืช ื•ึฐื”ึตื—ึตืœ ืœึฐืคึทืจึฐืกึตื ืžึทืึฒืžึธืจึดื™ื ื”ื•ึผืžื•ึนืจึดื™ืกึฐื˜ึดื™ึผึดื™ื']

Matres Lectionis (ืื™ืžื•ืช ืงืจื™ืื”)

As can be seen, the predict method automatically removed all the matres-lectionis (ืื™ืžื•ืช ืงืจื™ืื”). If you wish to keep them in, you can specify that to the predict function:

print(model.predict([sentence], tokenizer, mark_matres_lectionis = '*'))

Output:

['ื‘ึผึดืฉืึฐื ึทืช 1948 ื”ึดืฉืึฐืœึดื™ื ืึถืคึฐืจึทื™ึดื ืงึดื™ืฉืื•ึนืŸ ืึถืช ืœึดื™*ืžึผื•ึผื“ึธื™ื• ื‘ึผึฐืคึดื™*ืกึผื•ึผืœ ืžึทืชึผึถื›ึถืช ื•ึผื‘ึฐืชื•ึนืœึฐื“ื•ึนืช ื”ึธืื‡ืžึผึธื ื•ึผืช ื•ึฐื”ึตื—ึตืœ ืœึฐืคึทืจึฐืกึตื ืžึทืึฒืžึธืจึดื™ื ื”ื•ึผืžื•ึนืจึดื™ืกึฐื˜ึดื™ึผึดื™ื']

Community Project

A third-party project, dicta-onnx, offers a lightweight ONNX-based tool built on top of our model for adding Hebrew diacritics. We're not affiliated, but it's a cool and practical application you might find useful.

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0