---
license: apache-2.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
datasets:
- Jarbas/ovos_intents_train
base_model:
- FacebookAI/xlm-roberta-base
metrics:
- accuracy
- precision
- recall
- f1
- matthews_correlation
---

# XLM-RoBERTa OVOS intent classifier (base-sized model)

XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).

This model was fine-tuned to classify intents based on the dataset [Jarbas/ovos_intents_train](https://huggingface.co/datasets/Jarbas/ovos_intents_train)

## Intended uses & limitations

You can use the raw model for intent classification in the [Open Voice OS](https://www.openvoiceos.org/) project context.

## Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
model = AutoModelForSequenceClassification.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")
tokenizer = AutoTokenizer.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")
config = AutoConfig.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")

# preprocess dataset
def tokenize_function(examples):
examples["label"] = list(map(lambda x: config.label2id[x], examples["label"]))
return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
prediction = model.predict(tokenized_dataset)
```