Papers
arxiv:2409.15672

Language-based Audio Moment Retrieval

Published on Sep 24, 2024
Authors:
,
,

Abstract

A new task, Audio Moment Retrieval (AMR), is introduced, using a DETR-based model called Audio Moment DETR (AM-DETR) to predict relevant moments in long audio based on text queries, outperforming conventional clip-level audio retrieval methods.

AI-generated summary

In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving [email protected] by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.15672 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.