arXiv:2511.04962

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Published on Nov 7

· Submitted by

Zihao Yi on Nov 10

#1 Paper of the day

Tencent

Upvote

Authors:

Zhaopeng Tu ,

Abstract

LLMs struggle to authentically portray morally ambiguous or villainous characters due to safety alignment, as evidenced by the Moral RolePlay benchmark.

AI-generated summary

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

View arXiv page View PDF Project page GitHub 170 Add to collection

Community

Zihao1

Paper submitter 3 days ago

Are safety-aligned LLMs too good to truly play villains? 🤖🎭😈

Introducing Moral RolePlay, a balanced dataset with 800 characters across 4 moral levels (Paragons → Flawed → Egoists → Villains), featuring 77 personality traits and rigorous scene contexts. This enables the first large-scale, systematic evaluation of moral persona fidelity in LLMs.

🔍 Key findings:
📉 Role-playing fidelity drops as character morality decreases — especially for egoists and villains.
🚫 Models fail most on traits like "Deceitful" and "Manipulative", due to safety alignment conflicts.
⚠️ General chatbot skills ≠ good villain acting. Top Arena models fall short on moral ambiguity.
🧠 Explicit reasoning doesn't help much — models still sanitize complex antagonism.

✨ This work reveals a critical limitation in current alignment approaches — models trained to be "too good" cannot authentically simulate the full spectrum of human psychology, limiting their utility in creative, educational, and social science applications.

📏 Benchmark: https://github.com/Tencent/DigitalHuman/tree/main/RolePlay_Villain
📃 Paper: https://arxiv.org/abs/2511.04962
📊 Dataset: https://huggingface.co/datasets/Zihao1/Moral-RolePlay/tree/main

Ujjwal-Tyagi

3 days ago

This paper is really interesting and shows how alignment affects creativity. Since models don’t have feelings or real morals — they just follow patterns — it’s easy for bad actors to trick them, but at the same time their refusals can frustrate normal users. I think there should be a framework that balances both sides — keeping models safe but still allowing more natural and flexible behavior. so we should work on this that will be useful

zptu

Paper author 3 days ago

Thank you for your thoughtful comment. You’re absolutely right — while alignment is vital for safety, it can inadvertently constrain creative expression and nuanced role-play. Developing a framework that balances ethical safeguards with expressive flexibility is a crucial next step, especially for educational, creative, and research applications. We agree this is an important direction worth pursuing.