stefan-it commited on
Commit
24435e1
1 Parent(s): 45637ff

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +175 -0
README.md CHANGED
@@ -1,3 +1,178 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - multilingual
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bg
11
+ - bn
12
+ - br
13
+ - bs
14
+ - ca
15
+ - cs
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - eo
22
+ - es
23
+ - et
24
+ - eu
25
+ - fa
26
+ - fi
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - he
35
+ - hi
36
+ - hr
37
+ - hu
38
+ - hy
39
+ - id
40
+ - is
41
+ - it
42
+ - ja
43
+ - jv
44
+ - ka
45
+ - kk
46
+ - km
47
+ - kn
48
+ - ko
49
+ - ku
50
+ - ky
51
+ - la
52
+ - lo
53
+ - lt
54
+ - lv
55
+ - mg
56
+ - mk
57
+ - ml
58
+ - mn
59
+ - mr
60
+ - ms
61
+ - my
62
+ - ne
63
+ - nl
64
+ - no
65
+ - om
66
+ - or
67
+ - pa
68
+ - pl
69
+ - ps
70
+ - pt
71
+ - ro
72
+ - ru
73
+ - sa
74
+ - sd
75
+ - si
76
+ - sk
77
+ - sl
78
+ - so
79
+ - sq
80
+ - sr
81
+ - su
82
+ - sv
83
+ - sw
84
+ - ta
85
+ - te
86
+ - th
87
+ - tl
88
+ - tr
89
+ - ug
90
+ - uk
91
+ - ur
92
+ - uz
93
+ - vi
94
+ - xh
95
+ - yi
96
+ - zh
97
  license: mit
98
  ---
99
+
100
+ # XLM-V (Base-sized model)
101
+
102
+ XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
103
+ It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472)
104
+ paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa.
105
+
106
+ **Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team.
107
+
108
+ ## Model description
109
+
110
+ From the abstract of the XLM-V paper:
111
+
112
+ > Large multilingual language models typically rely on a single vocabulary shared across 100+ languages.
113
+ > As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged.
114
+ > This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R.
115
+ > In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by
116
+ > de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity
117
+ > to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically
118
+ > more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V,
119
+ > a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we
120
+ > tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and
121
+ > named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
122
+
123
+ ## Usage
124
+
125
+ You can use this model directly with a pipeline for masked language modeling:
126
+
127
+ ```python
128
+ >>> from transformers import pipeline
129
+ >>> unmasker = pipeline('fill-mask', model='stefan-it/xlm-v-base')
130
+ >>> unmasker("Paris is the <mask> of France.")
131
+
132
+ [{'score': 0.9286897778511047,
133
+ 'token': 133852,
134
+ 'token_str': 'capital',
135
+ 'sequence': 'Paris is the capital of France.'},
136
+ {'score': 0.018073994666337967,
137
+ 'token': 46562,
138
+ 'token_str': 'Capital',
139
+ 'sequence': 'Paris is the Capital of France.'},
140
+ {'score': 0.013238662853837013,
141
+ 'token': 8696,
142
+ 'token_str': 'centre',
143
+ 'sequence': 'Paris is the centre of France.'},
144
+ {'score': 0.010450296103954315,
145
+ 'token': 550136,
146
+ 'token_str': 'heart',
147
+ 'sequence': 'Paris is the heart of France.'},
148
+ {'score': 0.005028395913541317,
149
+ 'token': 60041,
150
+ 'token_str': 'center',
151
+ 'sequence': 'Paris is the center of France.'}]
152
+ ```
153
+
154
+ ## Bias, Risks, and Limitations
155
+
156
+ Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture
157
+ and has been trained on similar training data.
158
+
159
+ ### BibTeX entry and citation info
160
+
161
+ ```bibtex
162
+ @ARTICLE{2023arXiv230110472L,
163
+ author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian},
164
+ title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}",
165
+ journal = {arXiv e-prints},
166
+ keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
167
+ year = 2023,
168
+ month = jan,
169
+ eid = {arXiv:2301.10472},
170
+ pages = {arXiv:2301.10472},
171
+ doi = {10.48550/arXiv.2301.10472},
172
+ archivePrefix = {arXiv},
173
+ eprint = {2301.10472},
174
+ primaryClass = {cs.CL},
175
+ adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L},
176
+ adsnote = {Provided by the SAO/NASA Astrophysics Data System}
177
+ }
178
+ ```