arxiv:2110.06309

Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

Published on Oct 12, 2021

Authors:

Abstract

P-TAPT, a modified TAPT objective, enhances speech emotion recognition performance, particularly in low-resource conditions, by learning contextualized emotion representations.

AI-generated summary

While Wav2Vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT, especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4\% absolute improvement in unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2110.06309 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2110.06309 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2110.06309 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.