File size: 5,624 Bytes
ea1e85a
363fadd
ea1e85a
363fadd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea1e85a
 
363fadd
ea1e85a
363fadd
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
 
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
 
ea1e85a
363fadd
ea1e85a
363fadd
 
ea1e85a
363fadd
ea1e85a
363fadd
 
 
 
 
ea1e85a
 
 
363fadd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language: ["en", "multilingual"]
library_name: transformers
pipeline_tag: text-classification
tags:
- xlm-roberta
- sequence-classification
- intent-classification
- massive
- en-US
datasets:
- AmazonScience/massive
base_model: xlm-roberta-base
license: cc-by-4.0
metrics:
- accuracy
- f1
model-index:
- name: xlm-roberta-en-massive-intent
  results:
  - task:
      type: text-classification
      name: Intent Classification
    dataset:
      name: MASSIVE (en-US)
      type: AmazonScience/massive
      config: en-US
      split: validation
    metrics:
    - type: accuracy
      value: 0.8387
    - type: f1
      value: 0.8263
---

# xlm-roberta-en-massive-intent

Fine-tuned `xlm-roberta-base` for English intent classification on the MASSIVE dataset (en-US). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).

- Task: multi-class text classification (intent)
- Language: English (multilingual base)
- License: CC BY 4.0

## Usage

Using Transformers pipeline:

```python
from transformers import pipeline

clf = pipeline("text-classification", model="takehika/xlm-roberta-en-massive-intent")
clf("what's the weather today?")
```

From `from_pretrained`:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "takehika/xlm-roberta-en-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
```

## Data

- Dataset: `AmazonScience/massive`
- Locale/config: `en-US`
- Label space: 60 intents

## Preprocessing

- Tokenizer: `xlm-roberta-base` (fast)
- Settings: `max_length=256`, `truncation=True`

## Training

- Epochs: 3
- Learning rate: 2e-5
- Warmup ratio: 0.06
- Weight decay: 0.01
- Batch sizes: train/eval = 16

## Evaluation

Validation set metrics (en-US):

- Accuracy: 0.8387
- F1: 0.8263

## Intended Use & Limitations

- Intended for English assistant/chatbot intent recognition.
- Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
- Always validate on your target domain before use.

## Attribution & Licenses

- License: CC BY 4.0
  - When using or redistributing this fine-tuned model (or its weights), please credit the original authors, link to this model card, include the license (CC BY 4.0), and indicate if any changes were made.
- Base model: `xlm-roberta-base` by Meta AI — MIT License
  - Model card: https://huggingface.co/xlm-roberta-base
- Dataset: MASSIVE (en-US) by Amazon Science — CC BY 4.0
  - Dataset card: https://huggingface.co/datasets/AmazonScience/massive

This model modifies the base model by fine-tuning on the above dataset.

## Base Model Citation

Please cite the following when using the XLM-R base model:

```
@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

## Dataset Citation

Please cite the following when using the MASSIVE dataset:

```
@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}
```