Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Abstract
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.
Community
This paper proposes MMLA, the first comprehensive multimodal language analysis benchmark for evaluating foundation models. It has the following highlights and features:
- Various Sources: 9 datasets, 61K+ samples, 3 modalities, 76.6 videos. Both acting and real-world scenarios (Films, TV series, YouTube, Vimeo, Bilibili, TED, Improvised scripts, etc.).
- 6 Core semantic Dimensions: Intent, Emotion, Sentiment, Dialogue Act, Speaking Style, and Communication Behavior.
- 3 Evaluation Methods: Zero-shot Inference, Supervised Fine-tuning, and Instruction Tuning.
- 8 Mainstream Foundation Models: 5 MLLMs (Qwen2-VL, VideoLLaMA2, LLaVA-Video, LLaVA-OV, MiniCPM-V-2.6), 3 LLMs (InternLM2.5, Qwen2, LLaMA3).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Online Multi-Modal Social Interaction Understanding (2025)
- Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs (2025)
- Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models (2025)
- Aligning Multimodal LLM with Human Preference: A Survey (2025)
- OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs (2025)
- Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs (2025)
- MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper