NeMo End-to-End Speech Intent Classification and Slot Filling

Model Overview

This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].

Model Architecture

The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.

Training

The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this example script and this base config.

The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.

Details on how to train the model can be found here.

Datasets

The model is trained on the combined real and synthetic training sets of the SLURP dataset.

Performance

				Intent (Scenario_Action)		Entity			SLURP Metrics
Version	Model	Params (M)	Pretrained	Accuracy	Precision	Recall	F1	Precsion	Recall	F1
1.13.0	Conformer-Transformer-Large	127	NeMo ASR-Set 3.0	90.14	78.95	74.93	76.89	84.31	80.33	82.27
Baseline	Conformer-Transformer-Large	127	None	72.56	43.19	43.5	43.34	53.59	53.92	53.76

Note: during inference, we use beam size of 32, and a temperature of 1.25.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")

Predict intents and slots with this model

python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
 pretrained_name="slu_conformer_transformer_slurp" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
 sequence_generator.beam_size="<SIZE OF BEAM>" \
 sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"