C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
Abstract
A benchmark dataset for Spoken Dialogue Models (SDMs) in English and Chinese is presented to evaluate their performance in understanding and emulating human conversations, addressing challenges like ambiguity and context-dependency.
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.
Community
π Bilingual Coverage: Comprehensive evaluation in both English and Chinese.
π― Real-world Complexity: Based on empirical analysis of actual spoken dialogues, covering 1,079 instances with 1,586 audio-text paired samples.
πͺ LLM-based Automatic Evaluation: Reliable evaluation with >0.87 correlation to human judgments using GPT-4o and DeepSeek-R1.
π΅ End-to-End Focus: Specifically designed for end-to-end spoken dialogue models, considering crucial phonological features.
π Challenging Benchmark (July 2025): Comprehensive evaluation of 10 leading SDMs reveals the benchmarkβs difficulty. Top scores reach only 40.08 % (Chinese) and 55.68 % (English).
π£ C3 Benchmark: The Challenging Benchmark for Bilingual Speech Dialogue Models!
ποΈ C3 is the first-ever benchmark dataset that tests complex phenomena in speech dialogues, covering pauses, homophones, stress, intonation, syntactic ambiguity, coreference, omission, and multi-turn conversations.
π With 1,079 real-world scenarios and 1,586 audio-text pairs, it leaves speech dialogue models struggling to keep up!
π₯ Challenge Examples:
- "He saw the man / with glasses" vs "He saw / the man with glasses": Does he wear glasses or the man?
- "Mr. Smith loves music more than his wife": Does it mean "Mr. Smith loves music more than he loves his wife" or "Mr. Smith loves music more than his wife does"?
- "Joan made sure to thank Susan for all the help she had received": Does "she" refer to Joan or Susan?
π Evaluation Results (As of July 30, 2025):
- Best Model in Chinese: Qwen2.5-Omni (40.08%)
- Best Model in English: GPT-4o-Audio-Preview (55.68%)
π Experience C3 Now:
- Paper: Read the Paper
- Dataset: Explore the Dataset on Hugging Face
- Online Demo: Try the C3 Demo
- Code: Submit your SDM Evaluation Result
π₯ Limited Time Offer! We can help you run the evaluation script for your SDM's result on our benchmark, free of charge until Sept. 1, 2025. After that, you can run the evaluation independently. To participate, email
[email protected]
with subject:[C3Bench Evaluation] - [Model_Name]
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios (2025)
- WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation (2025)
- Aligning Spoken Dialogue Models from User Interactions (2025)
- DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech (2025)
- Audio-Aware Large Language Models as Judges for Speaking Styles (2025)
- MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark (2025)
- AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper