Hungarian ModernBERT-base
Model & tokenizer pretrained from scratch (MLM task) only on cleaned(removal of toxic and sexual content, automatic language correctness filtering) and semantically segmented Hungarian text.
Vocab: 52k
Context length of up to 8,192 tokens
Train dataset: ~1billion tokens