Baivaria
Baivaria is an encoder-only language model for Bavarian achieving new SOTA results on Named Entity Recognition (NER) and Part-of-Speech Tagging (PoS).
More detailed information about the model can be found in this GitHub repo.
📋 Changelog
- 18.09.2025: Initial version of this model.
Data Selection
We use the following Bavarian corpora for the pretraining of Baivaria:
- Bavarian Wikipedia
- Bavarian Bible
- Bavarian Awesome Tagesschau
- Bavarian Occiglot
- Bavarian Books
- Bavarian Finepdfs
The following table shows some stats for all corpora - after filtering:
Corpus Name | Quality Measures | Documents | Sentences | Tokens | Plaintext Size |
---|---|---|---|---|---|
Bavarian Wikipedia | High-quality Wikipedia | 43,627 | 242,245 | 7,001,569 | 21M |
Bavarian Bible | Gemini-translated | 1,189 | 35,156 | 1,346,116 | 3.8M |
Bavarian Awesome Tagesschau | Gemini-translated | 10,036 | 335,989 | 10,528,908 | 35M |
Bavarian Occiglot | Gemini-translated | 149,774 | 6,842,935 | 214,697,892 | 834M |
Bavarian Books | OCR'ed Books | 4,361 | 53,656 | 1,147,435 | 3.2M |
Bavarian Finepdfs | OCR'ed PDFs | 1,989 | 73,970 | 2,381,873 | 6.7M |
Overall, the pretraining corpus has 210,976 documents, 7,583,951 sentences, 237,103,793 tokens with a total plaintext size of 903M.
Pretraining
Pretraining a Bavarian model from scratch would be very inefficient, as there's not enough pretraining dataset.
For Baivaria we follow the main idea in the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper from Gururangan et al. and perform a domain-adaptive pretraining and continue pretraining from a strong German encoder-only model.
We use the recently released GERTuraX-3 as backbone and continue pretraining with our Bavarian corpus. Additionally, we perform a small hyper-parameter search and report micro F1-score on the BarNER NER dataset. The best performing ablation model is used as final model and is released as Baivaria in version 1.
Thanks to the TRC program, the following ablation models could be pretrained on a v4-32 TPU Pod:
Hyper-Parameter | Ablation 1 | Ablation 2 | Ablation 3 | Ablation 4 |
---|---|---|---|---|
decay_steps |
26.638 | 26.638 | 26.638 | 26.638 |
end_lr |
0.0 | 0.0 | 0.0 | 0.0 |
init_lr |
0.0003 | 0.0003 | 0.0005 | 0.0003 |
train_steps |
26.638 | 26.638 | 26.638 | 26.638 |
global_batch_size |
1024 | 1024 | 1024 | 1024 |
warmup_steps |
266 | 0 | 1598 | 2663 |
Ablation 3 is using the proposed hyper-parameters from the "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks" paper.
Now we report the MLM accuracy and loss for the training, including the downstream task performance on the BarNER dataset. For fine-tuning, we use the last checkpoint and the hyper-parameters as specified in the GERTuraX Fine-Tuner repo:
Metric | Ablation 1 | Ablation 2 | Ablation 3 | Ablation 4 |
---|---|---|---|---|
MLM Accuracy | 72.24 | 72.17 | 72.99 | 71.61 |
Train Loss | 2.9175 | 2.9248 | 2.8785 | 2.9689 |
BarNER F1-Score | 80.21 ± 0.31 | 80.83 ± 0.28 | 80.59 ± 0.35 | 80.06 ± 0.41 |
Results
Not many datasets for Bavarian exists for an evaluation on downstream tasks. We are using the following ones:
We use the GERTuraX Fine-Tuner repo and its hyper-parameter to fine-tune Baivaria for Bavarian NER and PoS Tagging.
Overall
In the overall section we compare results of Baivaria to current state-of-the-art results in the corresponding papers.
For NER:
Model | F1-Score (Final test dataset) |
---|---|
GBERT Large from BarNER | 72.17 ± 1.75 |
Baivaria v1 | 75.70 ± 0.97 |
For PoS Tagging:
Model | Accuracy (Final test dataset) | F1-Score (Final test dataset) |
---|---|---|
GBERT Large from MaiBaam | 80.29 | 62.45 |
Baivaria v1 | 90.28 ± 0.16 | 73.65 ± 0.91 |
❤️ Acknowledgements
Baivaria is the outcome of working with TPUs from the awesome TRC program and the TensorFlow Model Garden library.
Many thanks for providing TPUs!
Made from Bavarian Oberland with ❤️ and 🥨.
- Downloads last month
- 82
Model tree for bavarian-nlp/baivaria-v1
Base model
gerturax/gerturax-3