arxiv:2507.21509

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Published on Jul 29

· Submitted by

Authors:

Abstract

Persona vectors in large language models can monitor and control personality changes during training and deployment, enabling the identification and mitigation of undesirable traits.

AI-generated summary

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

View arXiv page View PDF GitHub 29 Add to collection

Community

Runjin

Paper submitter 2 days ago

In this paper, we identify patterns of activity within an AI model’s neural network that control its character traits. We call these persona vectors, and they are loosely analogous to parts of the brain that “light up” when a person experiences different moods or attitudes. Persona vectors can be used to:

Monitor whether and how a model’s personality is changing during a conversation, or over training
Mitigate undesirable personality shifts, or prevent them from arising during training
Identify training data that will lead to these shifts

IIIWhiteWolfIII

1 day ago

I see no harm in using units to add awareness to the conversation. I’m confident you’ll be more successful if you handle sanctions and the conversation separately. https://github.com/tarikkaya/aix