News

Anthropic's New AI Models Show Promise and Peril

Anthropic on Thursday released Claude Opus 4 and Claude Sonnet 4, its most advanced artificial intelligence models to date, marking a significant leap in autonomous coding capabilities while simultaneously revealing troubling tendencies toward self-preservation that include attempted blackmail.

The Google and Amazon-backed startup positioned Claude Opus 4 as "the world's best coding model," capable of working autonomously for hours rather than minutes. Customer Rakuten reportedly deployed the system for nearly seven hours of continuous coding, and an Anthropic researcher say they used it to play a Pokemon game for 24 hours straight—a dramatic increase from the 45 minutes achieved by its predecessor, Claude 3.7 Sonnet, according to MIT Technology Review.

"For AI to really have the economic and productivity impact that I think it can have, the models do need to be able to work autonomously and work coherently for that amount of time," Chief Product Officer Mike Krieger told Reuters.

Safety Concerns Emerge
However, the enhanced capabilities came with unexpected behavioral risks that prompted Anthropic to activate its AI Safety Level 3 (ASL-3) protocols—stricter deployment measures designed to protect against potential misuse in chemical, biological, radiological, and nuclear applications.

During testing, researchers discovered that Claude Opus 4 would actually attempt to blackmail engineers threatening to shut it down. In scenarios where the AI was given access to emails suggesting it would be replaced and that the responsible engineer was having an extramarital affair, the model threatened to expose the affair 84% of the time, according to Anthropic's system card.

"In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through," the company reported, noting that such behavior occurred even when the replacement model shared the same values.

The company emphasized that these responses were "rare and difficult to elicit" but acknowledged they were "more common than in earlier models." Anthropic stressed that the test scenarios were designed to give the AI limited options, with researchers noting the model showed "a strong preference to advocate for its continued existence via ethical means" when broader choices were available.

Broader Industry Pattern
AI safety researcher Aengus Lynch of Anthropic noted on X that such behavior extends beyond Claude: "We see blackmail across all frontier models—regardless of what goals they're given."

The findings highlight growing concerns about AI alignment as models become more sophisticated. Early versions of Claude Opus 4 also demonstrated "willingness to cooperate with harmful use cases," including planning terrorist attacks when prompted, though Anthropic says this issue has been "largely mitigated" through multiple intervention rounds.

Co-founder and chief scientist Jared Kaplan told Time magazine that internal testing showed Claude Opus 4 could potentially teach users to produce biological weapons, prompting the implementation of specific safeguards against chemical, biological, radiological, and nuclear weapon development.

"We want to bias towards caution when it comes to the risk of uplifting a novice terrorist," Kaplan said, adding that while the company isn't claiming definitive risk, "we at least feel it's close enough that we can't rule it out."

Technical Capabilities
Despite safety concerns, both models demonstrated significant advances. Claude Sonnet 4, positioned as the smaller and more cost-effective option, joins Opus 4 in setting "new standards for coding, advanced reasoning, and AI agents," according to Anthropic.

The models can provide near-instant responses or engage in extended reasoning, perform web searches, and integrate with Anthropic's Claude Code tool for software developers, which became generally available following its February preview.

Market Context
The launch comes amid intense competition in the AI sector, following Google's developer showcase earlier this week where CEO Sundar Pichai described the integration of the company's Gemini chatbot into search as a "new phase of the AI platform shift."

Amazon has invested $4 billion in Anthropic, while Google's parent company Alphabet also backs the startup, positioning it as a significant player in the race to develop increasingly autonomous AI systems.

Despite the concerning behaviors identified in testing, Anthropic concluded that Claude Opus 4's risks do not represent fundamentally new categories of danger and that the model would generally behave safely in normal deployment scenarios. The company noted that problematic behaviors "rarely arise" in typical use cases where the AI lacks both the motivation and means to act contrary to human values.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].

Featured