Baivaria

Baivaria is an encoder-only language model for Bavarian achieving new SOTA results on Named Entity Recognition (NER) and Part-of-Speech Tagging (PoS).

More detailed information about the model can be found in this GitHub repo.

📋 Changelog

18.09.2025: Initial version of this model.

Data Selection

We use the following Bavarian corpora for the pretraining of Baivaria:

The following table shows some stats for all corpora - after filtering:

Corpus Name	Quality Measures	Documents	Sentences	Tokens	Plaintext Size
Bavarian Wikipedia	High-quality Wikipedia	43,627	242,245	7,001,569	21M
Bavarian Bible	Gemini-translated	1,189	35,156	1,346,116	3.8M
Bavarian Awesome Tagesschau	Gemini-translated	10,036	335,989	10,528,908	35M
Bavarian Occiglot	Gemini-translated	149,774	6,842,935	214,697,892	834M
Bavarian Books	OCR'ed Books	4,361	53,656	1,147,435	3.2M
Bavarian Finepdfs	OCR'ed PDFs	1,989	73,970	2,381,873	6.7M

Overall, the pretraining corpus has 210,976 documents, 7,583,951 sentences, 237,103,793 tokens with a total plaintext size of 903M.

Pretraining

Pretraining a Bavarian model from scratch would be very inefficient, as there's not enough pretraining dataset.

For Baivaria we follow the main idea in the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper from Gururangan et al. and perform a domain-adaptive pretraining and continue pretraining from a strong German encoder-only model.

We use the recently released GERTuraX-3 as backbone and continue pretraining with our Bavarian corpus. Additionally, we perform a small hyper-parameter search and report micro F1-score on the BarNER NER dataset. The best performing ablation model is used as final model and is released as Baivaria in version 1.

Thanks to the TRC program, the following ablation models could be pretrained on a v4-32 TPU Pod:

Hyper-Parameter	Ablation 1	Ablation 2	Ablation 3	Ablation 4
`decay_steps`	26.638	26.638	26.638	26.638
`end_lr`	0.0	0.0	0.0	0.0
`init_lr`	0.0003	0.0003	0.0005	0.0003
`train_steps`	26.638	26.638	26.638	26.638
`global_batch_size`	1024	1024	1024	1024
`warmup_steps`	266	0	1598	2663

Ablation 3 is using the proposed hyper-parameters from the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper.

Now we report the MLM accuracy and loss for the training, including the downstream task performance on the BarNER dataset. For fine-tuning, we use the last checkpoint and the hyper-parameters as specified in the GERTuraX Fine-Tuner repo:

Metric	Ablation 1	Ablation 2	Ablation 3	Ablation 4
MLM Accuracy	72.24	72.17	72.99	71.61
Train Loss	2.9175	2.9248	2.8785	2.9689
BarNER F1-Score	80.21 ± 0.31	80.83 ± 0.28	80.59 ± 0.35	80.06 ± 0.41

Results

Not many datasets for Bavarian exists for an evaluation on downstream tasks. We are using the following ones:

BarNER NER dataset
MaiBaam Part-of-Speech Tagging dataset

We use the GERTuraX Fine-Tuner repo and its hyper-parameter to fine-tune Baivaria for Bavarian NER and PoS Tagging.

Overall

In the overall section we compare results of Baivaria to current state-of-the-art results in the corresponding papers.

For NER:

Model	F1-Score (Final test dataset)
GBERT Large from BarNER	72.17 ± 1.75
Baivaria v1	75.70 ± 0.97

For PoS Tagging:

Model	Accuracy (Final test dataset)	F1-Score (Final test dataset)
GBERT Large from MaiBaam	80.29	62.45
Baivaria v1	90.28 ± 0.16	73.65 ± 0.91

❤️ Acknowledgements

Baivaria is the outcome of working with TPUs from the awesome TRC program and the TensorFlow Model Garden library.

Many thanks for providing TPUs!

Made from Bavarian Oberland with ❤️ and 🥨.

Downloads last month: 82

Safetensors

Model size

135M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bavarian-nlp/baivaria-v1

Base model

gerturax/gerturax-3

Finetuned

(2)

this model

bavarian-nlp
/

baivaria-v1