--- license: apache-2.0 language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - 'no' - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh datasets: - Jarbas/ovos_intents_train base_model: - FacebookAI/xlm-roberta-base metrics: - accuracy - precision - recall - f1 - matthews_correlation --- # XLM-RoBERTa OVOS intent classifier (base-sized model) XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). This model was fine-tuned to classify intents based on the dataset [Jarbas/ovos_intents_train](https://huggingface.co/datasets/Jarbas/ovos_intents_train) ## Intended uses & limitations You can use the raw model for intent classification in the [Open Voice OS](https://www.openvoiceos.org/) project context. ## Usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig model = AutoModelForSequenceClassification.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier") tokenizer = AutoTokenizer.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier") config = AutoConfig.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier") # preprocess dataset def tokenize_function(examples): examples["label"] = list(map(lambda x: config.label2id[x], examples["label"])) return tokenizer(examples["sentence"], padding="max_length", truncation=True) tokenized_dataset = dataset.map(tokenize_function, batched=True) prediction = model.predict(tokenized_dataset) ```