VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Abstract
Voice Style Adaptation (VSA) evaluates the ability of spoken language models to modify their speaking style based on spoken instructions, using a bilingual benchmark and a Large Audio Language Model as a Judge framework.
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.
Community
We have released VStyle — a benchmark for evaluating speech dialogue models on their ability to control voice, and we propose a new task called Voice Style Adaptation, which aims to enable models to adjust their speaking style based on spoken instructions. Unlike prior work that mainly focuses on semantic correctness, VStyle emphasizes “whether the speech is expressive and natural.”
In terms of task design, we classify instructions into four categories: Acoustic Attributes, Natural Language Instructions, Role-Playing, and Implicit Empathy, progressing from lower-level control to higher-level expressiveness.
To achieve automated and measurable evaluation, we introduce Large Audio-Language Models (LALMs) as a Judge, which progressively score generated speech along three dimensions: textual faithfulness, style adherence, and naturalness. Experimental results show that gpt-4o-audio and Doubao achieved the latest SOTA performance on English and Chinese tasks, respectively, while open-source models still show a clear gap compared with proprietary systems on the Voice Style Adaptation task.
We have open-sourced the dataset and code, and we hope that VStyle can drive speech models toward more expressive, controllable, and human-centered voice interaction.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness (2025)
- SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents (2025)
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models (2025)
- MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios (2025)
- VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents (2025)
- TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios (2025)
- SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper