NeMo End-to-End Speech Intent Classification and Slot Filling
Model Overview
This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].
Model Architecture
The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
Training
The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this example script and this base config.
The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
Details on how to train the model can be found here.
Datasets
The model is trained on the combined real and synthetic training sets of the SLURP dataset.
Performance
Intent (Scenario_Action) | Entity | SLURP Metrics | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Version | Model | Params (M) | Pretrained | Accuracy | Precision | Recall | F1 | Precsion | Recall | F1 |
1.13.0 | Conformer-Transformer-Large | 127 | NeMo ASR-Set 3.0 | 90.14 | 78.95 | 74.93 | 76.89 | 84.31 | 80.33 | 82.27 |
Baseline | Conformer-Transformer-Large | 127 | None | 72.56 | 43.19 | 43.5 | 43.34 | 53.59 | 53.92 | 53.76 |
Note: during inference, we use beam size of 32, and a temperature of 1.25.
How to Use this Model
The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.
Automatically load the model from NGC
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
Predict intents and slots with this model
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
pretrained_name="slu_conformer_transformer_slurp" \
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
sequence_generator.beam_size="<SIZE OF BEAM>" \
sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
Input
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
Output
This model provides the intent and slot annotaions as a string for a given audio sample.
Limitations
Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.
References
[1] SLURP: A Spoken Language Understanding Resource Package
[2] Conformer: Convolution-augmented Transformer for Speech Recognition
- Downloads last month
- 10
Evaluation results
- F1 on SLURPtest set self-reported82.270
- Accuracy on SLURPtest set self-reported90.140