arxiv:2504.16427

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Published on Apr 23

· Submitted by

HanleiZhang on Apr 28

Upvote

Authors:

Hanlei Zhang ,

Zhuohang Li ,

Peiwu Wang ,

Haige Zhu ,

Abstract

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

View arXiv page View PDF GitHub repository Add to collection

Community

HanleiZhang

Paper author Paper submitter 1 day ago

•

edited 1 day ago

This paper proposes MMLA, the first comprehensive multimodal language analysis benchmark for evaluating foundation models. It has the following highlights and features:

Various Sources: 9 datasets, 61K+ samples, 3 modalities, 76.6 videos. Both acting and real-world scenarios (Films, TV series, YouTube, Vimeo, Bilibili, TED, Improvised scripts, etc.).
6 Core semantic Dimensions: Intent, Emotion, Sentiment, Dialogue Act, Speaking Style, and Communication Behavior.
3 Evaluation Methods: Zero-shot Inference, Supervised Fine-tuning, and Instruction Tuning.
8 Mainstream Foundation Models: 5 MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6), 3 LLMs (InternLM2.5, Qwen2, LLaMA3).