arxiv:2509.09716

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Published on Sep 9

· Submitted by

Authors:

Abstract

Voice Style Adaptation (VSA) evaluates the ability of spoken language models to modify their speaking style based on spoken instructions, using a bilingual benchmark and a Large Audio Language Model as a Judge framework.

AI-generated summary

Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.

View arXiv page View PDF Project page GitHub 18 Add to collection

Community

zhanjun

Paper submitter 12 days ago

•

edited 12 days ago

We have released VStyle — a benchmark for evaluating speech dialogue models on their ability to control voice, and we propose a new task called Voice Style Adaptation, which aims to enable models to adjust their speaking style based on spoken instructions. Unlike prior work that mainly focuses on semantic correctness, VStyle emphasizes “whether the speech is expressive and natural.”

In terms of task design, we classify instructions into four categories: Acoustic Attributes, Natural Language Instructions, Role-Playing, and Implicit Empathy, progressing from lower-level control to higher-level expressiveness.

To achieve automated and measurable evaluation, we introduce Large Audio-Language Models (LALMs) as a Judge, which progressively score generated speech along three dimensions: textual faithfulness, style adherence, and naturalness. Experimental results show that gpt-4o-audio and Doubao achieved the latest SOTA performance on English and Chinese tasks, respectively, while open-source models still show a clear gap compared with proprietary systems on the Voice Style Adaptation task.

We have open-sourced the dataset and code, and we hope that VStyle can drive speech models toward more expressive, controllable, and human-centered voice interaction.

📄 Paper
💻 Code
🌐 Project Homepage