Model Card for byt5-base-burchard-expansion
This model card describes a fine-tuned version of google/byt5-base
, adapted for the specific task of expanding abbreviations in 11th-century Latin manuscripts from the Burchards Dekret Digital (BDD) project.
- Model type: Byte-level sequence-to-sequence (ByT5)
- Fine-tuning method: Low-Rank Adaptation (LoRA) with 8-bit quantization.
- Base Model:
google/byt5-base
- Language: Medieval Latin (
la
) - Training Dataset:
mschonhardt/bdd-abbreviations-augmented
- Trainingscripts: Zenodo, Github
- Contact: Michael Schonhardt ([email protected], ORCID)
- Burchards Dekret Digital (BDD): Website
- Zenodo: Zenodo
Model Description
This repository contains the LoRA adapters for a ByT5-base model. It is not a standalone model but a set of trained weights that can be efficiently loaded on top of the original google/byt5-base
to specialize it for a single task: expanding scribal abbreviations found in the manuscripts of Burchard's Decree.
The ByT5 architecture was chosen because it operates directly on UTF-8 bytes, making it exceptionally robust for paleographic tasks. It requires no custom tokenizer and can handle the rich set of special Unicode characters (MUFI) and orthographic variations present in medieval texts without encountering "unknown token" issues.
The model was fine-tuned using 8-bit quantization and PEFT (LoRA), which significantly reduces the computational resources required for training and inference while maintaining high performance.
Intended Use
The primary use of this model is to automate the expansion of abbreviations in texts transcribed from the five key manuscripts of the Decretum Burchardi. It serves as a key component in a digital editing workflow, supporting the creation of TEI-XML critical editions.
How to Use
First, install the necessary libraries.
pip install transformers torch accelerate peft bitsandbytes
The model can be loaded with the base model (google/byt5-base) and the adapters from this repository.
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import PeftModel
# Load models
base_model_id = "google/byt5-base"
adapter_model_id = "mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s"
# Load the base tokenizer and model (with 8-bit quantization)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForSeq2SeqLM.from_pretrained(
base_model_id,
load_in_8bit=True,
device_map="auto"
)
# Load the LoRA adapters onto the base model
model = PeftModel.from_pretrained(base_model, adapter_model_id)
model.eval()
# Prepare the input text
# Note the prefix used during training
prefix = "expand abbreviations: "
abbreviated_text = "om̅s posteri eorū cuncta sibi uendicarent sed semꝑ maiores causę sicut s̅ ep̅oꝝ..."
input_text = prefix + abbreviated_text
# Tokenize and generate
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids, max_length=1024)
expanded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Abbreviated: {abbreviated_text}")
print(f"Expanded: {expanded_text}")
# Expected output: omnes posteri eorum cuncta sibi uendicarent sed semper maiores causę sicut sunt episcoporum...
Training and Evaluation
Training Data
The model was fine-tuned on the mschonhardt/bdd-abbreviations-augmented dataset
(https://huggingface.co/datasets/mschonhardt/bdd-abbreviations-augmented). This dataset consists of parallel text lines extracted from the five principal manuscripts of the Decretum Burchardi. Each entry contains an abbreviated source_text and a manually verified, fully expanded target_text. Rare abbreviations were automatically multiplied to better represent them in the final model.
Training Procedure
The model was trained using the provided training script (Zenodo, Github), which uses the Hugging Face transformers and peft libraries. The model was evaluated at the end of each epoch on a held-out test split (10%) of the training data. The final model represents the checkpoint with the lowest evaluation loss of 0.0025.
Training Hyperparameters
- learning_rate: 3e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 5
- PEFT Method: LoRA
- r: 32
- lora_alpha: 64
- lora_dropout: 0.05
- target_modules: ["q", "k", "v", "o"]
Framework Versions
- PEFT: 0.16.0
- Transformers: 4.53.2
- Pytorch: 2.7.1+cu128
- Datasets: 4.0.0
- Tokenizers: 0.21.2
Limitations and Bias
- High Specificity: This model is highly specialized. It is trained on the scribal conventions of a single scriptorium (Worms) from a specific period (early 11th century). Its performance will likely degrade significantly on manuscripts from other regions or time periods without further fine-tuning.
- Augmented Data: As the dataset was augmented to better represent rare brevigraphs, the trained model might fail in instances where the distribution of brevigraphs differs significantly.
- Fixed Abbreviation Set: The model can only expand abbreviations that were present in its training data. It cannot generalize to unseen brevigraphs.
- Context-Dependent: While the 3-line window used for training provides local context, the model may still struggle with highly ambiguous abbreviations where broader semantic understanding is required.
Citation
If you use this model in your research, please cite appropriatly.
@misc{schonhardt_byt5_burchard_2025,
author = {Schonhardt, Michael},
title = {ByT5-base-burchard-expansion: A LoRA-finetuned model for Medieval Latin Abbreviation Expansion},
year = {2025},
institution = {Burchards Dekret Digital},
DOI = {https://doi.org/10.5281/zenodo.16736386},
howpublished = {\url{[https://huggingface.co/mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s](https://huggingface.co/mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s)}}
}
- Downloads last month
- 6
Model tree for mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s
Base model
google/byt5-baseDataset used to train mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s
Collection including mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s
Evaluation results
- eval_loss on mschonhardt/bdd-abbreviations-augmentedtest set self-reported0.003