Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Abstract
A multimodal LLM-based model generates natural and engaging speech by integrating visual and audio modalities, using a novel MultiSensory Conversation dataset.
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC
Community
Published in Interspeech 2025
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniTalker: Conversational Speech-Visual Synthesis (2025)
- FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot (2025)
- Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance (2025)
- Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS (2025)
- SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents (2025)
- Dual Information Speech Language Models for Emotional Conversations (2025)
- Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper