arxiv:2507.22968

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Published on Jul 30

· Submitted by

itaowe on Aug 1

#3 Paper of the day

Upvote

Authors:

Abstract

A benchmark dataset for Spoken Dialogue Models (SDMs) in English and Chinese is presented to evaluate their performance in understanding and emulating human conversations, addressing challenges like ambiguity and context-dependency.

AI-generated summary

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

View arXiv page View PDF Project page GitHub 20 Add to collection

Community

itaowe

Paper submitter 2 days ago

🌍 Bilingual Coverage: Comprehensive evaluation in both English and Chinese.

🎯 Real-world Complexity: Based on empirical analysis of actual spoken dialogues, covering 1,079 instances with 1,586 audio-text paired samples.

💪 LLM-based Automatic Evaluation: Reliable evaluation with >0.87 correlation to human judgments using GPT-4o and DeepSeek-R1.

🎵 End-to-End Focus: Specifically designed for end-to-end spoken dialogue models, considering crucial phonological features.

📊 Challenging Benchmark (July 2025): Comprehensive evaluation of 10 leading SDMs reveals the benchmark’s difficulty. Top scores reach only 40.08 % (Chinese) and 55.68 % (English).

ChengqianMa

2 days ago

•

edited 2 days ago

📣 C3 Benchmark: The Challenging Benchmark for Bilingual Speech Dialogue Models!

🎙️ C3 is the first-ever benchmark dataset that tests complex phenomena in speech dialogues, covering pauses, homophones, stress, intonation, syntactic ambiguity, coreference, omission, and multi-turn conversations.
📊 With 1,079 real-world scenarios and 1,586 audio-text pairs, it leaves speech dialogue models struggling to keep up!

🔥 Challenge Examples:

"He saw the man / with glasses" vs "He saw / the man with glasses": Does he wear glasses or the man?
"Mr. Smith loves music more than his wife": Does it mean "Mr. Smith loves music more than he loves his wife" or "Mr. Smith loves music more than his wife does"?
"Joan made sure to thank Susan for all the help she had received": Does "she" refer to Joan or Susan?

📈 Evaluation Results (As of July 30, 2025):

Best Model in Chinese: Qwen2.5-Omni (40.08%)
Best Model in English: GPT-4o-Audio-Preview (55.68%)

🔗 Experience C3 Now:

Paper: Read the Paper
Dataset: Explore the Dataset on Hugging Face
Online Demo: Try the C3 Demo
Code: Submit your SDM Evaluation Result

🔥 Limited Time Offer! We can help you run the evaluation script for your SDM's result on our benchmark, free of charge until Sept. 1, 2025. After that, you can run the evaluation independently. To participate, email [email protected] with subject: [C3Bench Evaluation] - [Model_Name]