takehika's picture
Update README.md
363fadd verified
metadata
language:
  - en
  - multilingual
library_name: transformers
pipeline_tag: text-classification
tags:
  - xlm-roberta
  - sequence-classification
  - intent-classification
  - massive
  - en-US
datasets:
  - AmazonScience/massive
base_model: xlm-roberta-base
license: cc-by-4.0
metrics:
  - accuracy
  - f1
model-index:
  - name: xlm-roberta-en-massive-intent
    results:
      - task:
          type: text-classification
          name: Intent Classification
        dataset:
          name: MASSIVE (en-US)
          type: AmazonScience/massive
          config: en-US
          split: validation
        metrics:
          - type: accuracy
            value: 0.8387
          - type: f1
            value: 0.8263

xlm-roberta-en-massive-intent

Fine-tuned xlm-roberta-base for English intent classification on the MASSIVE dataset (en-US). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).

  • Task: multi-class text classification (intent)
  • Language: English (multilingual base)
  • License: CC BY 4.0

Usage

Using Transformers pipeline:

from transformers import pipeline

clf = pipeline("text-classification", model="takehika/xlm-roberta-en-massive-intent")
clf("what's the weather today?")

From from_pretrained:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "takehika/xlm-roberta-en-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

Data

  • Dataset: AmazonScience/massive
  • Locale/config: en-US
  • Label space: 60 intents

Preprocessing

  • Tokenizer: xlm-roberta-base (fast)
  • Settings: max_length=256, truncation=True

Training

  • Epochs: 3
  • Learning rate: 2e-5
  • Warmup ratio: 0.06
  • Weight decay: 0.01
  • Batch sizes: train/eval = 16

Evaluation

Validation set metrics (en-US):

  • Accuracy: 0.8387
  • F1: 0.8263

Intended Use & Limitations

  • Intended for English assistant/chatbot intent recognition.
  • Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
  • Always validate on your target domain before use.

Attribution & Licenses

This model modifies the base model by fine-tuning on the above dataset.

Base Model Citation

Please cite the following when using the XLM-R base model:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dataset Citation

Please cite the following when using the MASSIVE dataset:

@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}