Papers
arXiv:2511.04962

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Published on Nov 7
Β· Submitted by Zihao Yi on Nov 10
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

LLMs struggle to authentically portray morally ambiguous or villainous characters due to safety alignment, as evidenced by the Moral RolePlay benchmark.

AI-generated summary

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

Community

Paper submitter

Are safety-aligned LLMs too good to truly play villains? πŸ€–πŸŽ­πŸ˜ˆ

Introducing Moral RolePlay, a balanced dataset with 800 characters across 4 moral levels (Paragons β†’ Flawed β†’ Egoists β†’ Villains), featuring 77 personality traits and rigorous scene contexts. This enables the first large-scale, systematic evaluation of moral persona fidelity in LLMs.

πŸ” Key findings:
πŸ“‰ Role-playing fidelity drops as character morality decreases β€” especially for egoists and villains.
🚫 Models fail most on traits like "Deceitful" and "Manipulative", due to safety alignment conflicts.
⚠️ General chatbot skills β‰  good villain acting. Top Arena models fall short on moral ambiguity.
🧠 Explicit reasoning doesn't help much β€” models still sanitize complex antagonism.

✨ This work reveals a critical limitation in current alignment approaches β€” models trained to be "too good" cannot authentically simulate the full spectrum of human psychology, limiting their utility in creative, educational, and social science applications.

πŸ“ Benchmark: https://github.com/Tencent/DigitalHuman/tree/main/RolePlay_Villain
πŸ“ƒ Paper: https://arxiv.org/abs/2511.04962
πŸ“Š Dataset: https://huggingface.co/datasets/Zihao1/Moral-RolePlay/tree/main

This paper is really interesting and shows how alignment affects creativity. Since models don’t have feelings or real morals β€” they just follow patterns β€” it’s easy for bad actors to trick them, but at the same time their refusals can frustrate normal users. I think there should be a framework that balances both sides β€” keeping models safe but still allowing more natural and flexible behavior. so we should work on this that will be useful

Β·
Paper author

Thank you for your thoughtful comment. You’re absolutely right β€” while alignment is vital for safety, it can inadvertently constrain creative expression and nuanced role-play. Developing a framework that balances ethical safeguards with expressive flexibility is a crucial next step, especially for educational, creative, and research applications. We agree this is an important direction worth pursuing.

Paper author
This comment has been hidden (marked as Resolved)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.04962 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.04962 in a Space README.md to link it from this page.

Collections including this paper 7