arxiv:2507.23188

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Published on Jul 31

Authors:

Abstract

A framework for multi-modal motion retrieval incorporating text, audio, video, and motion using sequence-level contrastive learning achieves superior performance compared to existing methods.

AI-generated summary

Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.23188 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.23188 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.23188 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.