File size: 5,624 Bytes

ea1e85a
363fadd
ea1e85a
363fadd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea1e85a
 
363fadd
ea1e85a
363fadd
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
 
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
ea1e85a
363fadd
 
 
 
 
ea1e85a
 
 
363fadd

---
language: ["en", "multilingual"]
library_name: transformers
pipeline_tag: text-classification
tags:
- xlm-roberta
- sequence-classification
- intent-classification
- massive
- en-US
datasets:
- AmazonScience/massive
base_model: xlm-roberta-base
license: cc-by-4.0
metrics:
- accuracy
- f1
model-index:
- name: xlm-roberta-en-massive-intent
  results:
  - task:
      type: text-classification
      name: Intent Classification
    dataset:
      name: MASSIVE (en-US)
      type: AmazonScience/massive
      config: en-US
      split: validation
    metrics:
    - type: accuracy
      value: 0.8387
    - type: f1
      value: 0.8263
---

# xlm-roberta-en-massive-intent

Fine-tuned `xlm-roberta-base` for English intent classification on the MASSIVE dataset (en-US). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).

- Task: multi-class text classification (intent)
- Language: English (multilingual base)
- License: CC BY 4.0

## Usage

Using Transformers pipeline:

```python
from transformers import pipeline

clf = pipeline("text-classification", model="takehika/xlm-roberta-en-massive-intent")
clf("what's the weather today?")
```

From `from_pretrained`:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "takehika/xlm-roberta-en-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
```

## Data

- Dataset: `AmazonScience/massive`
- Locale/config: `en-US`
- Label space: 60 intents

## Preprocessing

- Tokenizer: `xlm-roberta-base` (fast)
- Settings: `max_length=256`, `truncation=True`

## Training

- Epochs: 3
- Learning rate: 2e-5
- Warmup ratio: 0.06
- Weight decay: 0.01
- Batch sizes: train/eval = 16

## Evaluation

Validation set metrics (en-US):

- Accuracy: 0.8387
- F1: 0.8263

## Intended Use & Limitations

- Intended for English assistant/chatbot intent recognition.
- Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
- Always validate on your target domain before use.

## Attribution & Licenses

- License: CC BY 4.0
  - When using or redistributing this fine-tuned model (or its weights), please credit the original authors, link to this model card, include the license (CC BY 4.0), and indicate if any changes were made.
- Base model: `xlm-roberta-base` by Meta AI — MIT License
  - Model card: https://huggingface.co/xlm-roberta-base
- Dataset: MASSIVE (en-US) by Amazon Science — CC BY 4.0
  - Dataset card: https://huggingface.co/datasets/AmazonScience/massive

This model modifies the base model by fine-tuning on the above dataset.

## Base Model Citation

Please cite the following when using the XLM-R base model:

```
@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

## Dataset Citation

Please cite the following when using the MASSIVE dataset:

```
@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}
```