PyTorch
TensorBoard
Safetensors
Bavarian
electra

Baivaria

alt text

Baivaria is an encoder-only language model for Bavarian achieving new SOTA results on Named Entity Recognition (NER) and Part-of-Speech Tagging (PoS).

More detailed information about the model can be found in this GitHub repo.

📋 Changelog

  • 18.09.2025: Initial version of this model.

Data Selection

We use the following Bavarian corpora for the pretraining of Baivaria:

The following table shows some stats for all corpora - after filtering:

Corpus Name Quality Measures Documents Sentences Tokens Plaintext Size
Bavarian Wikipedia High-quality Wikipedia 43,627 242,245 7,001,569 21M
Bavarian Bible Gemini-translated 1,189 35,156 1,346,116 3.8M
Bavarian Awesome Tagesschau Gemini-translated 10,036 335,989 10,528,908 35M
Bavarian Occiglot Gemini-translated 149,774 6,842,935 214,697,892 834M
Bavarian Books OCR'ed Books 4,361 53,656 1,147,435 3.2M
Bavarian Finepdfs OCR'ed PDFs 1,989 73,970 2,381,873 6.7M

Overall, the pretraining corpus has 210,976 documents, 7,583,951 sentences, 237,103,793 tokens with a total plaintext size of 903M.

Pretraining

Pretraining a Bavarian model from scratch would be very inefficient, as there's not enough pretraining dataset.

For Baivaria we follow the main idea in the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper from Gururangan et al. and perform a domain-adaptive pretraining and continue pretraining from a strong German encoder-only model.

We use the recently released GERTuraX-3 as backbone and continue pretraining with our Bavarian corpus. Additionally, we perform a small hyper-parameter search and report micro F1-score on the BarNER NER dataset. The best performing ablation model is used as final model and is released as Baivaria in version 1.

Thanks to the TRC program, the following ablation models could be pretrained on a v4-32 TPU Pod:

Hyper-Parameter Ablation 1 Ablation 2 Ablation 3 Ablation 4
decay_steps 26.638 26.638 26.638 26.638
end_lr 0.0 0.0 0.0 0.0
init_lr 0.0003 0.0003 0.0005 0.0003
train_steps 26.638 26.638 26.638 26.638
global_batch_size 1024 1024 1024 1024
warmup_steps 266 0 1598 2663

Ablation 3 is using the proposed hyper-parameters from the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper.

Now we report the MLM accuracy and loss for the training, including the downstream task performance on the BarNER dataset. For fine-tuning, we use the last checkpoint and the hyper-parameters as specified in the GERTuraX Fine-Tuner repo:

Metric Ablation 1 Ablation 2 Ablation 3 Ablation 4
MLM Accuracy 72.24 72.17 72.99 71.61
Train Loss 2.9175 2.9248 2.8785 2.9689
BarNER F1-Score 80.21 ± 0.31 80.83 ± 0.28 80.59 ± 0.35 80.06 ± 0.41

Results

Not many datasets for Bavarian exists for an evaluation on downstream tasks. We are using the following ones:

We use the GERTuraX Fine-Tuner repo and its hyper-parameter to fine-tune Baivaria for Bavarian NER and PoS Tagging.

Overall

In the overall section we compare results of Baivaria to current state-of-the-art results in the corresponding papers.

For NER:

Model F1-Score (Final test dataset)
GBERT Large from BarNER 72.17 ± 1.75
Baivaria v1 75.70 ± 0.97

For PoS Tagging:

Model Accuracy (Final test dataset) F1-Score (Final test dataset)
GBERT Large from MaiBaam 80.29 62.45
Baivaria v1 90.28 ± 0.16 73.65 ± 0.91

❤️ Acknowledgements

Baivaria is the outcome of working with TPUs from the awesome TRC program and the TensorFlow Model Garden library.

Many thanks for providing TPUs!

Made from Bavarian Oberland with ❤️ and 🥨.

Downloads last month
82
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bavarian-nlp/baivaria-v1

Finetuned
(2)
this model

Datasets used to train bavarian-nlp/baivaria-v1