Papers
arxiv:2509.09716

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Published on Sep 9
· Submitted by JunZhan on Sep 15
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Voice Style Adaptation (VSA) evaluates the ability of spoken language models to modify their speaking style based on spoken instructions, using a bilingual benchmark and a Large Audio Language Model as a Judge framework.

AI-generated summary

Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.

Community

Paper submitter
edited 8 days ago

We have released VStyle — a benchmark for evaluating speech dialogue models on their ability to control voice, and we propose a new task called Voice Style Adaptation, which aims to enable models to adjust their speaking style based on spoken instructions. Unlike prior work that mainly focuses on semantic correctness, VStyle emphasizes “whether the speech is expressive and natural.”

In terms of task design, we classify instructions into four categories: Acoustic Attributes, Natural Language Instructions, Role-Playing, and Implicit Empathy, progressing from lower-level control to higher-level expressiveness.

To achieve automated and measurable evaluation, we introduce Large Audio-Language Models (LALMs) as a Judge, which progressively score generated speech along three dimensions: textual faithfulness, style adherence, and naturalness. Experimental results show that gpt-4o-audio and Doubao achieved the latest SOTA performance on English and Chinese tasks, respectively, while open-source models still show a clear gap compared with proprietary systems on the Voice Style Adaptation task.

We have open-sourced the dataset and code, and we hope that VStyle can drive speech models toward more expressive, controllable, and human-centered voice interaction.

📄 Paper
💻 Code
🌐 Project Homepage

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.09716 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.09716 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.09716 in a Space README.md to link it from this page.

Collections including this paper 1