Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,124 @@ tags: []
|
|
11 |
|
12 |
# Modern-LiBERTa
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
# Modern-LiBERTa
|
13 |
|
14 |
+
|
15 |
+
|
16 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
17 |
+
Modern-LiBERTa is a ModernBERT encoder model designed specifically for **Ukrainian**, with support for **long contexts up to 8,192 tokens**. It was introduced in the paper On the Path to Make Ukrainian a High-Resource Language presented at the [UNLP](https://unlp.org.ua/) @ [ACL 2025](https://2025.aclweb.org/).
|
18 |
+
|
19 |
+
The model is pre-trained on **Kobza**, a large-scale Ukrainian corpus of nearly 60 billion tokens. Modern-LiBERTa builds on the [ModernBERT](https://arxiv.org/abs/2412.13663) architecture and is the first Ukrainian language model to support long-context encoding efficiently.
|
20 |
+
|
21 |
+
The goal of this work is to **make Ukrainian a first-class citizen in multilingual and monolingual NLP**, enabling robust performance on complex tasks that require broader context and knowledge access.
|
22 |
+
|
23 |
+
All training code and tokenizer tools are available in the [Goader/ukr-lm](https://github.com/Goader/ukr-lm) repository.
|
24 |
+
|
25 |
+
|
26 |
+
## Evaluation
|
27 |
+
|
28 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
29 |
+
|
30 |
+
<!-- Read the [paper](https://aclanthology.org/2024.unlp-1.14/) for more detailed tasks descriptions. -->
|
31 |
+
|
32 |
+
| | NER-UK (Micro F1) | WikiANN (Micro F1) | UD POS (Accuracy) | News (Macro F1) |
|
33 |
+
|:------------------------------------------------------------------------------------------------------------------------|:------------------------:|:------------------:|:------------------------------:|:----------------------------------------:|
|
34 |
+
| <tr><td colspan="5" style="text-align: center;"><strong>Base Models</strong></td></tr>
|
35 |
+
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 90.86 (0.81) | 92.27 (0.09) | 98.45 (0.07) | - |
|
36 |
+
| [roberta-base-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-base-wechsel-ukrainian) | 90.81 (1.51) | 92.98 (0.12) | 98.57 (0.03) | - |
|
37 |
+
| [electra-base-ukrainian-cased-discriminator](https://huggingface.co/lang-uk/electra-base-ukrainian-cased-discriminator) | 90.43 (1.29) | 92.99 (0.11) | 98.59 (0.06) | - |
|
38 |
+
| <tr><td colspan="5" style="text-align: center;"><strong>Large Models</strong></td></tr>
|
39 |
+
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | 90.16 (2.98) | 92.92 (0.19) | 98.71 (0.04) | 95.13 (0.49) |
|
40 |
+
| [roberta-large-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-large-wechsel-ukrainian) | 91.24 (1.16) | 93.22 (0.17) | 98.74 (0.06) | __96.48 (0.09)__ |
|
41 |
+
| [liberta-large](https://huggingface.co/Goader/liberta-large) | 91.27 (1.22) | 92.50 (0.07) | 98.62 (0.08) | 95.44 (0.04) |
|
42 |
+
| [liberta-large-v2](https://huggingface.co/Goader/liberta-large-v2) | __91.73 (1.81)__ | 93.22 (0.14) | __98.79 (0.06)__ | 95.67 (0.12) |
|
43 |
+
| [modern-liberta-large-v2](https://huggingface.co/Goader/modern-liberta-large) | 91.66 (0.57) | __93.37 (0.16)__ | __98.78 (0.07)__ | 96.37 (0.07) |
|
44 |
+
|
45 |
+
|
46 |
+
## Fine-Tuning Hyperparameters
|
47 |
+
|
48 |
+
| Hyperparameter | Value |
|
49 |
+
|:---------------|:-----:|
|
50 |
+
| Peak Learning Rate | 3e-5 |
|
51 |
+
| Warm-up Ratio | 0.05 |
|
52 |
+
| Learning Rate Decay | Linear |
|
53 |
+
| Batch Size | 16 |
|
54 |
+
| Epochs | 10 |
|
55 |
+
| Weight Decay | 0.05 |
|
56 |
+
|
57 |
+
|
58 |
+
## How to Get Started with the Model
|
59 |
+
|
60 |
+
Use the code below to get started with the model. Note, that the repository contains custom code for tokenization:
|
61 |
+
|
62 |
+
Pipeline usage:
|
63 |
+
|
64 |
+
```python
|
65 |
+
>>> from transformers import pipeline
|
66 |
+
>>> fill_mask = pipeline("fill-mask", "Goader/modern-liberta-large", trust_remote_code=True)
|
67 |
+
>>> fill_mask("Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі <mask> яблук мамі.")
|
68 |
+
[{'score': 0.3426803946495056,
|
69 |
+
'token': 8638,
|
70 |
+
'token_str': 'шість',
|
71 |
+
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.'},
|
72 |
+
{'score': 0.21772164106369019,
|
73 |
+
'token': 24170,
|
74 |
+
'token_str': 'решту',
|
75 |
+
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі решту яблук мамі.'},
|
76 |
+
{'score': 0.16074775159358978,
|
77 |
+
'token': 9947,
|
78 |
+
'token_str': 'вісім',
|
79 |
+
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі вісім яблук мамі.'},
|
80 |
+
{'score': 0.078955739736557,
|
81 |
+
'token': 2036,
|
82 |
+
'token_str': 'сім',
|
83 |
+
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі сім яблук мамі.'},
|
84 |
+
{'score': 0.028996430337429047,
|
85 |
+
'token': 813,
|
86 |
+
'token_str': '6',
|
87 |
+
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі 6 яблук мамі.'}]
|
88 |
+
```
|
89 |
+
|
90 |
+
Extracting embeddings:
|
91 |
+
|
92 |
+
```python
|
93 |
+
from transformers import AutoTokenizer, AutoModel
|
94 |
+
tokenizer = AutoTokenizer.from_pretrained("Goader/modern-liberta-large", trust_remote_code=True)
|
95 |
+
model = AutoModel.from_pretrained("Goader/modern-liberta-large")
|
96 |
+
encoded = tokenizer('Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.', return_tensors='pt')
|
97 |
+
output = model(**encoded)
|
98 |
+
```
|
99 |
+
|
100 |
+
<!-- ## Citation -->
|
101 |
+
|
102 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
103 |
+
|
104 |
+
<!-- ```
|
105 |
+
@inproceedings{haltiuk-smywinski-pohl-2024-liberta,
|
106 |
+
title = "{L}i{BERT}a: Advancing {U}krainian Language Modeling through Pre-training from Scratch",
|
107 |
+
author = "Haltiuk, Mykola and
|
108 |
+
Smywi{\'n}ski-Pohl, Aleksander",
|
109 |
+
editor = "Romanyshyn, Mariana and
|
110 |
+
Romanyshyn, Nataliia and
|
111 |
+
Hlybovets, Andrii and
|
112 |
+
Ignatenko, Oleksii",
|
113 |
+
booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
|
114 |
+
month = may,
|
115 |
+
year = "2024",
|
116 |
+
address = "Torino, Italia",
|
117 |
+
publisher = "ELRA and ICCL",
|
118 |
+
url = "https://aclanthology.org/2024.unlp-1.14",
|
119 |
+
pages = "120--128",
|
120 |
+
abstract = "Recent advancements in Natural Language Processing (NLP) have spurred remarkable progress in language modeling, predominantly benefiting English. While Ukrainian NLP has long grappled with significant challenges due to limited data and computational resources, recent years have seen a shift with the emergence of new corpora, marking a pivotal moment in addressing these obstacles. This paper introduces LiBERTa Large, the inaugural BERT Large model pre-trained entirely from scratch only on Ukrainian texts. Leveraging extensive multilingual text corpora, including a substantial Ukrainian subset, LiBERTa Large establishes a foundational resource for Ukrainian NLU tasks. Our model outperforms existing multilingual and monolingual models pre-trained from scratch for Ukrainian, demonstrating competitive performance against those relying on cross-lingual transfer from English. This achievement underscores our ability to achieve superior performance through pre-training from scratch with additional enhancements, obviating the need to rely on decisions made for English models to efficiently transfer weights. We establish LiBERTa Large as a robust baseline, paving the way for future advancements in Ukrainian language modeling.",
|
121 |
+
}
|
122 |
+
``` -->
|
123 |
+
|
124 |
+
## Licence
|
125 |
+
|
126 |
+
CC-BY 4.0
|
127 |
+
|
128 |
+
## Authors
|
129 |
+
|
130 |
+
Mykola Haltiuk,
|
131 |
+
PhD Candidate @ AGH University of Krakow
|
132 |
+
|
133 |
+
Aleksander Smywiński-Pohl,
|
134 |
+
PhD @ AGH University of Krakow
|