facebook
/

xlm-v-base

 ---
+language:
+- multilingual
+- af
+- am
+- ar
+- as
+- az
+- be
+- bg
+- bn
+- br
+- bs
+- ca
+- cs
+- cy
+- da
+- de
+- el
+- en
+- eo
+- es
+- et
+- eu
+- fa
+- fi
+- fr
+- fy
+- ga
+- gd
+- gl
+- gu
+- ha
+- he
+- hi
+- hr
+- hu
+- hy
+- id
+- is
+- it
+- ja
+- jv
+- ka
+- kk
+- km
+- kn
+- ko
+- ku
+- ky
+- la
+- lo
+- lt
+- lv
+- mg
+- mk
+- ml
+- mn
+- mr
+- ms
+- my
+- ne
+- nl
+- no
+- om
+- or
+- pa
+- pl
+- ps
+- pt
+- ro
+- ru
+- sa
+- sd
+- si
+- sk
+- sl
+- so
+- sq
+- sr
+- su
+- sv
+- sw
+- ta
+- te
+- th
+- tl
+- tr
+- ug
+- uk
+- ur
+- uz
+- vi
+- xh
+- yi
+- zh
 license: mit
 ---
+# XLM-V (Base-sized model)
+XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
+It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472)
+paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa.
+**Disclaimer**: The team releasing XLM-V did not write a model card for this model so this model card has been written by the Hugging Face team.
+## Model description
+From the abstract of the XLM-V paper:
+> Large multilingual language models typically rely on a single vocabulary shared across 100+ languages.
+> As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged.
+> This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R.
+> In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by
+> de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity
+> to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically
+> more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V,
+> a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we
+> tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and
+> named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
+## Usage
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='stefan-it/xlm-v-base')
+>>> unmasker("Paris is the <mask> of France.")
+[{'score': 0.9286897778511047,
+  'token': 133852,
+  'token_str': 'capital',
+  'sequence': 'Paris is the capital of France.'},
+ {'score': 0.018073994666337967,
+  'token': 46562,
+  'token_str': 'Capital',
+  'sequence': 'Paris is the Capital of France.'},
+ {'score': 0.013238662853837013,
+  'token': 8696,
+  'token_str': 'centre',
+  'sequence': 'Paris is the centre of France.'},
+ {'score': 0.010450296103954315,
+  'token': 550136,
+  'token_str': 'heart',
+  'sequence': 'Paris is the heart of France.'},
+ {'score': 0.005028395913541317,
+  'token': 60041,
+  'token_str': 'center',
+  'sequence': 'Paris is the center of France.'}]
+```
+## Bias, Risks, and Limitations
+Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because XLM-V has a similar architecture
+and has been trained on similar training data.
+### BibTeX entry and citation info
+```bibtex
+@ARTICLE{2023arXiv230110472L,
+       author = {{Liang}, Davis and {Gonen}, Hila and {Mao}, Yuning and {Hou}, Rui and {Goyal}, Naman and {Ghazvininejad}, Marjan and {Zettlemoyer}, Luke and {Khabsa}, Madian},
+        title = "{XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models}",
+      journal = {arXiv e-prints},
+     keywords = {Computer Science - Computation and Language, Computer Science - Machine Learning},
+         year = 2023,
+        month = jan,
+          eid = {arXiv:2301.10472},
+        pages = {arXiv:2301.10472},
+          doi = {10.48550/arXiv.2301.10472},
+archivePrefix = {arXiv},
+       eprint = {2301.10472},
+ primaryClass = {cs.CL},
+       adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230110472L},
+      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+```