File size: 4,290 Bytes
6dc55eb
24435e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6dc55eb
 
24435e1
 
 
 
 
 
 
75ac7bc
24435e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68c75dd
24435e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
language: 
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: mit
---

# XLM-V (Base-sized model)

XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472)
paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa.

**Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team. [This repository](https://github.com/stefan-it/xlm-v-experiments) documents all necessary integeration steps.

## Model description

From the abstract of the XLM-V paper:

> Large multilingual language models typically rely on a single vocabulary shared across 100+ languages.
> As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged.
> This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R.
> In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by
> de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity
> to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically
> more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V,
> a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we
> tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and
> named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).

## Usage

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='facebook/xlm-v-base')
>>> unmasker("Paris is the <mask> of France.")

[{'score': 0.9286897778511047,
  'token': 133852,
  'token_str': 'capital',
  'sequence': 'Paris is the capital of France.'},
 {'score': 0.018073994666337967,
  'token': 46562,
  'token_str': 'Capital',
  'sequence': 'Paris is the Capital of France.'},
 {'score': 0.013238662853837013,
  'token': 8696,
  'token_str': 'centre',
  'sequence': 'Paris is the centre of France.'},
 {'score': 0.010450296103954315,
  'token': 550136,
  'token_str': 'heart',
  'sequence': 'Paris is the heart of France.'},
 {'score': 0.005028395913541317,
  'token': 60041,
  'token_str': 'center',
  'sequence': 'Paris is the center of France.'}]
```

## Bias, Risks, and Limitations

Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture
and has been trained on similar training data.

### BibTeX entry and citation info

```bibtex
@ARTICLE{2023arXiv230110472L,
       author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian},
        title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
         year = 2023,
        month = jan,
          eid = {arXiv:2301.10472},
        pages = {arXiv:2301.10472},
          doi = {10.48550/arXiv.2301.10472},
archivePrefix = {arXiv},
       eprint = {2301.10472},
 primaryClass = {cs.CL},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
```