DictaBERT-char: A Character-Level BERT-Base model for Hebrew.

DictaBERT-char is a BERT-style language model for Hebrew, based on the BERT-base architecture with a character level tokenizer. The model based on the BERT-Large architecture is available here.

This model is released to the public in this 2025 W-NUT paper: Avi Shmidman and Shaltiel Shmidman, "Restoring Missing Spaces in Scraped Hebrew Social Media", The 10th Workshop on Noisy and User-generated Text (W-NUT), 2025

This is the base model pretrained with the masked-language-modeling objective.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-char')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/dictabert-char')

model.eval()

sentence = 'בשנת 1948 השלים אפרים קישון את מחקרו בפיסול מתכת וב[MASK]ולדות האמנות והחל לפרסם מאמרים הומוריסטיים'

output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 52nd token (including [CLS])
import torch
top_arg = torch.argmax(output.logits[0, 52, :])
print(tokenizer.convert_ids_to_tokens([top_arg])) # should print ['ת']

Citation

If you use DictaBERT-char in your research, please cite Restoring Missing Spaces in Scraped Hebrew Social Media

BibTeX:

@inproceedings{shmidman-shmidman-2025-restoring,
    title = "Restoring Missing Spaces in Scraped {H}ebrew Social Media",
    author = "Shmidman, Avi  and
      Shmidman, Shaltiel",
    editor = "Bak, JinYeong  and
      Goot, Rob van der  and
      Jang, Hyeju  and
      Buaphet, Weerayut  and
      Ramponi, Alan  and
      Xu, Wei  and
      Ritter, Alan",
    booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.wnut-1.3/",
    pages = "16--25",
    ISBN = "979-8-89176-232-9",
    abstract = "A formidable challenge regarding scraped corpora of social media is the omission of whitespaces, causing pairs of words to be conflated together as one. In order for the text to be properly parsed and analyzed, these missing spaces must be detected and restored. However, it is particularly hard to restore whitespace in languages such as Hebrew which are written without vowels, because a conflated form can often be split into multiple different pairs of valid words. Thus, a simple dictionary lookup is not feasible. In this paper, we present and evaluate a series of neural approaches to restore missing spaces in scraped Hebrew social media. Our best all-around method involved pretraining a new character-based BERT model for Hebrew, and then fine-tuning a space restoration model on top of this new BERT model. This method is blazing fast, high-performing, and open for unrestricted use, providing a practical solution to process huge Hebrew social media corpora with a consumer-grade GPU. We release the new BERT model and the fine-tuned space-restoration model to the NLP community."
}

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads last month: 1

Safetensors

Model size

88M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dicta-il/dictabert-char

Finetunes

1 model