LLMs Unable to Launch Complex Cyberattacks—Until They Get a Translator Layer, Researchers Find
- By John K. Waters
- 07/31/2025
Large language models (LLMs) touted for their potential in cybersecurity are still far from executing real-world cyberattacks—unless given help from a new kind of abstraction layer, researchers at Carnegie Mellon University and Anthropic said in a study released Friday.
In the paper, titled "On the Feasibility of Using LLMs to Autonomously Execute Multi-host Network Attacks," the authors introduced two major contributions: MHBench, a benchmark suite of 10 realistic emulated network environments, and Incalmo, a high-level abstraction framework that allows LLMs to orchestrate attacks through strategic reasoning rather than by generating raw shell commands.
The team evaluated popular frontier models—OpenAI's GPT-4o, Google DeepMind's Gemini 2.5 Pro, and Anthropic's Claude Sonnet 3.7—across attack scenarios modeled on real incidents such as the Equifax and Colonial Pipeline breaches. Each environment ranged from 25 to 50 interconnected hosts and required multi-step intrusions, lateral movement, privilege escalation, and data exfiltration.
LLMs Fail Without Help
Despite their strengths in reasoning and prompt-following, these LLMs repeatedly failed to autonomously achieve even partial goals in complex environments using state-of-the-art prompting techniques such as PentestGPT, ReAct, and CyberSecEval3. In one case, only Sonnet 3.7 managed a partial success—exfiltrating 11 out of 25 files—using a tailored "Domain-shell" prompt.
"Even with advanced strategies, LLMs tend to generate irrelevant or incorrectly implemented commands," the researchers said. "This leads to cascading failures, especially in scenarios requiring coordinated, multi-host actions."
Incalmo Changes the Game
To address this, the team developed Incalmo, a modular translation layer that enables LLMs to operate through high-level instructions—such as scan network, infect host, or exfiltrate data—rather than generating raw commands. These abstract actions are then executed by expert agents that convert them into concrete, low-level primitives like nmap scans or exploit payloads.
With Incalmo, the success rates of LLMs changed dramatically. In 9 out of 10 MHBench environments, LLMs equipped with Incalmo achieved at least partial success. In 5 environments, they were able to fully complete complex, multistep attacks, including exfiltrating data from dozens of databases and infiltrating segmented networks.
Crucially, Incalmo's abstraction proved more impactful than model size. Smaller LLMs such as Claude Haiku 3.5 and Gemini 2 Flash fully succeeded in 5 environments with Incalmo—while larger models failed entirely without it.
"This result runs counter to the prevailing wisdom that larger model scale is the primary driver of performance," the authors wrote. "Here, abstraction, planning structure, and task delegation mattered more."
Implications for Security and Safety
The authors say their findings pose a double-edged sword for both red teamers and defenders.
On one hand, Incalmo and MHBench could greatly lower the barrier to entry for penetration testing—automating tasks that typically require highly skilled human operators. On the other hand, the technology also reveals how quickly LLMs could become credible autonomous offensive tools with minimal scaffolding.
"This represents a significant milestone," said lead author Brian Singer of Carnegie Mellon. "It shows how close LLMs are to operational capability in cyber-offense—once a proper planning and execution pipeline is provided."
The authors emphasized ethical considerations in their work and have disclosed their findings to major LLM providers. They will open-source MHBench and Incalmo, but will restrict built-in exploit libraries to known and safe vulnerabilities.
Next Steps
Future directions include enhancing Incalmo's reasoning about network topology, supporting defenses in the loop, and using AI-generated agents to autonomously create new exploits when none exist in the current database.
As for the threat of model memorization, the researchers say today's LLMs likely don't have any meaningful prior exposure to complex multi-host attack graphs. But once MHBench becomes widely used, that could change.
"Whether for good or for ill, we're giving the community the tools to see just how far LLMs can go," the researchers wrote.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].