File size: 4,634 Bytes
235ff9e
 
 
 
 
 
 
 
 
 
 
 
 
c536a4a
235ff9e
 
 
 
 
 
c536a4a
235ff9e
 
 
 
 
 
 
 
 
 
 
 
 
3707c8d
235ff9e
 
 
 
 
 
 
d77bfad
 
235ff9e
 
 
 
 
 
 
3707c8d
235ff9e
 
 
 
 
 
 
 
 
 
d77bfad
235ff9e
 
 
 
 
d77bfad
235ff9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b35bb1
235ff9e
 
 
3b35bb1
 
 
 
 
 
 
 
235ff9e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
language:
- ga
inference: false
tags:
- BERT
- HPLT
- encoder
license: apache-2.0
datasets:
- HPLT/HPLT2.0_cleaned
---

# HPLT v2.0 BERT for Irish

<img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>

This is one of the encoder-only monolingual language models trained as a second release by the [HPLT project](https://hplt-project.org/).
It is a so called masked language model. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).

We present monolingual LTG-BERT models for more than 50 languages out of 191 total in the [HPLT v2.0 dataset](https://hplt-project.org/datasets/v2.0).

All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:
- hidden size: 768
- attention heads: 12
- layers: 12
- vocabulary size: 32768

Every model uses its own tokenizer trained on language-specific HPLT data. 

[The training code](https://github.com/hplt-project/HPLT-WP4).

[The training statistics of all runs](https://api.wandb.ai/links/ltg/kduj7mjn)

## Example usage (tested with `transformers==4.46.1` and `tokenizers==0.20.1`)

This model currently needs a custom wrapper from `modeling_ltgbert.py`, you should therefore load the model with `trust_remote_code=True`.

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_2_0_gle-Latn")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_2_0_gle-Latn", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist(), clean_up_tokenization_spaces=True))
```

The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.

## Intermediate checkpoints

We are releasing 10 intermediate checkpoints for each model at intervals of every 3125 training steps in separate branches. The naming convention is `stepXXX`: for example, `step18750`.

You can load a specific model revision with `transformers` using the argument `revision`:
```python
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_2_0_gle-Latn", revision="step21875", trust_remote_code=True)
```

You can access all the revisions for the models with the following code:
```python
from huggingface_hub import list_repo_refs
out = list_repo_refs("HPLT/hplt_bert_base_2_0_gle-Latn")
print([b.name for b in out.branches])
```

## Cite us

```bibtex
@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    doi = "10.18653/v1/2023.findings-eacl.146",
    pages = "1954--1974"
}
```

```bibtex
@misc{burchell2025expandedmassivemultilingualdataset,
      title={An Expanded Massive Multilingual Dataset for High-Performance Language Technologies},
      author={Laurie Burchell and Ona de Gibert and Nikolay Arefyev and Mikko Aulamo and Marta Bañón and Pinzhen Chen and Mariia Fedorova and Liane Guillou and Barry Haddow and Jan Hajič and Jindřich Helcl and Erik Henriksson and Mateusz Klimaszewski and Ville Komulainen and Andrey Kutuzov and Joona Kytöniemi and Veronika Laippala and Petter Mæhlum and Bhavitvya Malik and Farrokh Mehryary and Vladislav Mikhailov and Nikita Moghe and Amanda Myntti and Dayyán O'Brien and Stephan Oepen and Proyag Pal and Jousia Piha and Sampo Pyysalo and Gema Ramírez-Sánchez and David Samuel and Pavel Stepachev and Jörg Tiedemann and Dušan Variš and Tereza Vojtěchová and Jaume Zaragoza-Bernabeu},
      year={2025},
      eprint={2503.10267},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.10267},
}
```