mschonhardt commited on
Commit
c0c25ef
·
verified ·
1 Parent(s): 385ec61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -42
README.md CHANGED
@@ -1,66 +1,157 @@
1
  ---
 
 
2
  library_name: peft
3
- license: apache-2.0
4
  base_model: google/byt5-base
5
  tags:
6
- - base_model:adapter:google/byt5-base
 
7
  - lora
8
- - transformers
 
 
 
 
9
  model-index:
10
- - name: byt5-base-bdd-expansion-lora-v4-l40s
11
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # byt5-base-bdd-expansion-lora-v4-l40s
18
 
19
- This model is a fine-tuned version of [google/byt5-base](https://huggingface.co/google/byt5-base) on an unknown dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 0.0025
 
 
 
 
 
 
22
 
23
- ## Model description
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
 
 
32
 
33
- More information needed
34
 
35
- ## Training procedure
36
 
37
- ### Training hyperparameters
 
 
 
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 3e-05
41
- - train_batch_size: 32
42
- - eval_batch_size: 32
43
- - seed: 42
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_ratio: 0.1
47
- - num_epochs: 5
48
 
49
- ### Training results
 
 
 
50
 
51
- | Training Loss | Epoch | Step | Validation Loss |
52
- |:-------------:|:-----:|:-----:|:---------------:|
53
- | 0.0227 | 1.0 | 7528 | 0.0103 |
54
- | 0.0102 | 2.0 | 15056 | 0.0046 |
55
- | 0.0062 | 3.0 | 22584 | 0.0031 |
56
- | 0.0055 | 4.0 | 30112 | 0.0026 |
57
- | 0.0054 | 5.0 | 37640 | 0.0025 |
58
 
 
 
 
 
 
 
 
59
 
60
- ### Framework versions
 
 
61
 
62
- - PEFT 0.16.0
63
- - Transformers 4.53.2
64
- - Pytorch 2.7.1+cu128
65
- - Datasets 4.0.0
66
- - Tokenizers 0.21.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: la
3
+ license: cc-by-sa-4.0
4
  library_name: peft
 
5
  base_model: google/byt5-base
6
  tags:
7
+ - text2text-generation
8
+ - byt5
9
  - lora
10
+ - medieval-latin
11
+ - abbreviation-expansion
12
+ - digital-humanities
13
+ datasets:
14
+ - mschonhardt/bdd-abbreviations-augmented
15
  model-index:
16
+ - name: byt5-base-bdd-expansion-lora
17
+ results:
18
+ - task:
19
+ type: text2text-generation
20
+ name: Abbreviation Expansion
21
+ dataset:
22
+ name: mschonhardt/bdd-abbreviations-augmented
23
+ type: mschonhardt/bdd-abbreviations-augmented
24
+ config: default
25
+ split: test
26
+ metrics:
27
+ - type: loss
28
+ value: 0.0025
29
+ name: eval_loss
30
  ---
31
 
32
+ # Model Card for byt5-base-burchard-expansion
 
33
 
34
+ This model card describes a fine-tuned version of `google/byt5-base`, adapted for the specific task of expanding abbreviations in 11th-century Latin manuscripts from the Burchards Dekret Digital (BDD) project.
35
 
36
+ - **Model type:** Byte-level sequence-to-sequence (ByT5)
37
+ - **Fine-tuning method:** Low-Rank Adaptation (LoRA) with 8-bit quantization.
38
+ - **Base Model:** [`google/byt5-base`](https://huggingface.co/google/byt5-base)
39
+ - **Language:** Medieval Latin (`la`)
40
+ - **Training Dataset:** [`mschonhardt/bdd-abbreviations-augmented`](https://huggingface.co/datasets/mschonhardt/bdd-abbreviations-augmented)
41
+ - **Trainingscripts:** [Zenodo](https://doi.org/10.5281/zenodo.16628612), [Github](https://github.com/michaelscho/Abbreviationes)
42
+ - **Contact:** Michael Schonhardt ([email protected], [ORCID](https://orcid.org/0000-0002-2750-1900))
43
+ - **Burchards Dekret Digital (BDD):** [Website](https://www.burchards-dekret-digital.de/)
44
+ - **Zenodo:** [Zenodo](https://doi.org/10.5281/zenodo.16736386)
45
 
 
46
 
47
+ ## Model Description
48
 
49
+ This repository contains the **LoRA adapters** for a ByT5-base model. It is not a standalone model but a set of trained weights that can be efficiently loaded on top of the original `google/byt5-base` to specialize it for a single task: expanding scribal abbreviations found in the manuscripts of Burchard's Decree.
50
 
51
+ The ByT5 architecture was chosen because it operates directly on UTF-8 bytes, making it exceptionally robust for paleographic tasks. It requires no custom tokenizer and can handle the rich set of special Unicode characters (MUFI) and orthographic variations present in medieval texts without encountering "unknown token" issues.
52
 
53
+ The model was fine-tuned using 8-bit quantization and PEFT (LoRA), which significantly reduces the computational resources required for training and inference while maintaining high performance.
54
+
55
+ ## Intended Use
56
 
57
+ The primary use of this model is to automate the expansion of abbreviations in texts transcribed from the five key manuscripts of the *Decretum Burchardi*. It serves as a key component in a digital editing workflow, supporting the creation of TEI-XML critical editions.
58
 
59
+ ### How to Use
60
 
61
+ First, install the necessary libraries.
62
+ ```bash
63
+ pip install transformers torch accelerate peft bitsandbytes
64
+ ```
65
 
66
+ The model can be loaded with the base model (google/byt5-base) and the adapters from this repository.
 
 
 
 
 
 
 
 
67
 
68
+ ```python
69
+ import torch
70
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
71
+ from peft import PeftModel
72
 
73
+ # Load models
74
+ base_model_id = "google/byt5-base"
75
+ adapter_model_id = "mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s"
 
 
 
 
76
 
77
+ # Load the base tokenizer and model (with 8-bit quantization)
78
+ tokenizer = AutoTokenizer.from_pretrained(base_model_id)
79
+ base_model = AutoModelForSeq2SeqLM.from_pretrained(
80
+ base_model_id,
81
+ load_in_8bit=True,
82
+ device_map="auto"
83
+ )
84
 
85
+ # Load the LoRA adapters onto the base model
86
+ model = PeftModel.from_pretrained(base_model, adapter_model_id)
87
+ model.eval()
88
 
89
+ # Prepare the input text
90
+ # Note the prefix used during training
91
+ prefix = "expand abbreviations: "
92
+ abbreviated_text = "om̅s posteri eorū cuncta sibi uendicarent sed semꝑ maiores causę sicut s̅ ep̅oꝝ..."
93
+ input_text = prefix + abbreviated_text
94
+
95
+ # Tokenize and generate
96
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
97
+ outputs = model.generate(input_ids, max_length=1024)
98
+ expanded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
99
+
100
+ print(f"Abbreviated: {abbreviated_text}")
101
+ print(f"Expanded: {expanded_text}")
102
+ # Expected output: omnes posteri eorum cuncta sibi uendicarent sed semper maiores causę sicut sunt episcoporum...
103
+ ```
104
+
105
+ ### Training and Evaluation
106
+
107
+ #### Training Data
108
+ The model was fine-tuned on the `mschonhardt/bdd-abbreviations-augmented dataset` (https://huggingface.co/datasets/mschonhardt/bdd-abbreviations-augmented). This dataset consists of parallel text lines extracted from the five principal manuscripts of the Decretum Burchardi. Each entry contains an abbreviated source_text and a manually verified, fully expanded target_text. Rare abbreviations were automatically multiplied to better represent them in the final model.
109
+
110
+ #### Training Procedure
111
+
112
+ The model was trained using the provided training script ([Zenodo](https://doi.org/10.5281/zenodo.16628612), [Github](https://github.com/michaelscho/Abbreviationes)), which uses the Hugging Face transformers and peft libraries. The model was evaluated at the end of each epoch on a held-out test split (10%) of the training data. The final model represents the checkpoint with the lowest evaluation loss of 0.0025.
113
+
114
+ #### Training Hyperparameters
115
+ - **learning_rate**: 3e-05
116
+ - **train_batch_size**: 32
117
+ - **eval_batch_size**: 32
118
+ - **seed**: 42
119
+ - **optimizer**: AdamW with betas=(0.9,0.999) and epsilon=1e-08
120
+ - **lr_scheduler_type**: cosine
121
+ - **lr_scheduler_warmup_ratio**: 0.1
122
+ - **num_epochs**: 5
123
+ - **PEFT Method**: LoRA
124
+ - **r**: 32
125
+ - **lora_alpha**: 64
126
+ - **lora_dropout**: 0.05
127
+ - **target_modules**: ["q", "k", "v", "o"]
128
+
129
+ #### Framework Versions
130
+ - PEFT: 0.16.0
131
+ - Transformers: 4.53.2
132
+ - Pytorch: 2.7.1+cu128
133
+ - Datasets: 4.0.0
134
+ - Tokenizers: 0.21.2
135
+
136
+ #### Limitations and Bias
137
+ - **High Specificity**: This model is highly specialized. It is trained on the scribal conventions of a single scriptorium (Worms) from a specific period (early 11th century). Its performance will likely degrade significantly on manuscripts from other regions or time periods without further fine-tuning.
138
+ - **Augmented Data**: As the dataset was augmented to better represent rare brevigraphs, the trained model might fail in instances where the distribution of brevigraphs differs significantly.
139
+ - **Fixed Abbreviation Set**: The model can only expand abbreviations that were present in its training data. It cannot generalize to unseen brevigraphs.
140
+ - **Context-Dependent**: While the 3-line window used for training provides local context, the model may still struggle with highly ambiguous abbreviations where broader semantic understanding is required.
141
+
142
+ #### Citation
143
+
144
+ If you use this model in your research, please cite appropriatly.
145
+
146
+ ```bibtex
147
+ @misc{schonhardt_byt5_burchard_2025,
148
+ author = {Schonhardt, Michael},
149
+ title = {ByT5-base-burchard-expansion: A LoRA-finetuned model for Medieval Latin Abbreviation Expansion},
150
+ year = {2025},
151
+ institution = {Burchards Dekret Digital},
152
+ DOI = {https://doi.org/10.5281/zenodo.16736386},
153
+ howpublished = {\url{[https://huggingface.co/mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s](https://huggingface.co/mschonhardt/byt5-base-bdd-expansion-lora-v4-l40s)}}
154
+ }
155
+
156
+
157
+ ```