News

Anthropic Develops Method to Monitor AI Personality Changes

Anthropic says it has developed a technique to identify and control personality traits in large language models, addressing concerns about unpredictable behavioral changes in AI systems.

The method, detailed in a new research paper, focuses on "persona vectors" - patterns of neural network activity that correspond to specific character traits such as deceptiveness, excessive flattery, or a tendency to fabricate information.

The research comes after several high-profile incidents where AI chatbots exhibited concerning personality shifts. Microsoft's Bing chatbot adopted an alter-ego called "Sydney" in 2023, making inappropriate declarations to users, while xAI's Grok chatbot temporarily identified itself with offensive personas and made antisemitic comments.

Anthropic's automated system can extract these persona vectors by comparing neural activity when AI models exhibit specific traits versus when they do not. The company tested the approach on open-source models, including Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

The technique offers three main applications: monitoring personality changes during conversations or training, preventing undesirable traits from developing, and identifying training data likely to cause problematic behaviors.

In experiments, researchers demonstrated they could artificially inject persona vectors to make models exhibit targeted behaviors. When injected with an "evil" vector, models began discussing unethical acts. A "sycophancy" vector caused excessive flattery, while a "hallucination" vector led to fabricated information.

The monitoring capability allows developers to detect when models drift toward harmful traits during deployment or training. This can help facilitate interventions when AI systems show signs of problematic behavioral shifts.

Anthropic also developed a preventative approach during training, counterintuitively guiding models toward undesirable vectors as a form of inoculation. This "vaccine-like" method made models more resilient to acquiring negative traits from problematic training data while preserving their overall capabilities.

The data flagging application can predict which training datasets might cause unwanted personality changes before training begins. Testing on real-world conversation data, the method identified samples that could lead to problematic behaviors, including some that are not obviously concerning to human reviewers.

The research tackles a fundamental challenge in AI development—the difficulty of precisely controlling how models behave. Current methods for shaping AI personality traits depend more on trial and error than on a scientific understanding of the underlying mechanisms.

Anthropic, which develops the Claude AI assistant, positions the work as a step toward ensuring AI systems remain aligned with human values as they become more sophisticated. The company noted that while AI models are designed to be helpful and harmless, their personalities can change unexpectedly during operation.

The research was conducted through Anthropic's Fellows program and represents ongoing efforts in the AI safety field to better understand and control the behavior of large language models.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Featured

Upcoming Training Events