Papers
arxiv:2507.21509

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Published on Jul 29
· Submitted by Runjin on Aug 1
Authors:
,
,
,
,

Abstract

Persona vectors in large language models can monitor and control personality changes during training and deployment, enabling the identification and mitigation of undesirable traits.

AI-generated summary

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.

Community

Paper submitter

In this paper, we identify patterns of activity within an AI model’s neural network that control its character traits. We call these persona vectors, and they are loosely analogous to parts of the brain that “light up” when a person experiences different moods or attitudes. Persona vectors can be used to:

  1. Monitor whether and how a model’s personality is changing during a conversation, or over training
  2. Mitigate undesirable personality shifts, or prevent them from arising during training
  3. Identify training data that will lead to these shifts

I see no harm in using units to add awareness to the conversation. I’m confident you’ll be more successful if you handle sanctions and the conversation separately. https://github.com/tarikkaya/aix

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.21509 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.21509 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.21509 in a Space README.md to link it from this page.

Collections including this paper 1