Researchers Propose New Method to Strengthen AI Security Testing -- Pure AI

Researchers Propose New Method to Strengthen AI Security Testing

By John K. Waters
08/01/2025

A team of researchers from Fuel iX Applied Research and collaborating institutions has introduced a novel method to evaluate and improve adversarial testing of large language models (LLMs), aiming to help developers uncover potential vulnerabilities more effectively.

Their work centers on a refinement to Automated Red Teaming (ART), a technique that uses AI systems to generate attacks against other AI systems. The group proposes measuring the Attack Success Rate (ASR) not just across a set of attacks, but across repeated trials of individual attacks—an approach they say captures more detail about how reliably an attack might succeed.

Single attempts often don't reflect the real-world reliability of an attack, the researchers argue, as LLM responses vary due to inherent randomness. Repeating attacks helps us understand which ones are consistently effective. The study also contends that assessing attacks over multiple runs against a randomly seeded target—rather than relying on a single success or failure—yields a more nuanced understanding of the system's vulnerability. The resulting distribution of ASR scores, researchers assert, provides valuable insights into which prompts are more likely to reveal security weaknesses.

Prompting as Optimization Tool
The team also employed a newer approach to improving attack generators called Optimization by Prompting (OPRO), which leverages an LLM's reasoning abilities to refine its attack strategies. In the method, the model is shown pairs of semantically similar attacks with significantly different ASRs and asked to learn from the more effective one.

This technique, referred to as ASR-delta pair mining, relies on identifying subtle linguistic changes that correlate with higher success rates. The optimizer then generates improved prompts designed to produce stronger attack attempts in future iterations. The idea is to feed the model examples of what works and what doesn't. The model learns to generate prompts that are more likely to succeed, not just once, but consistently.

Comparing Evaluation Strategies
To validate their method, the researchers compared it to a baseline version that used one-time success rates. Both methods improved the attack generator's effectiveness, but the ASR-delta version performed better, according to the paper.

The findings suggest that distribution-based analysis of attack success offers a stronger foundation for assessing generator quality, especially in scenarios where attackers are limited by time or computational resources and need to prioritize attacks that are consistently effective.

Balancing Breadth and Depth
One key insight from the research involves what the authors call "attack discoverability"—how likely a particular attack is to succeed across different runs.

Because AI responses are probabilistic, an attack that works once may not work again. Focusing only on binary success risks misjudging both the strength of the attack and the robustness of the system.

"Attackers face a trade-off," the paper notes. "They must balance between exploring many different prompts and exploiting prompts that have a higher likelihood of repeated success."

By capturing a full distribution of outcomes, the ASR-based approach makes this trade-off more visible.

Limitations and Future Directions
While the results are promising, the researchers acknowledge several limitations. The study was conducted using OpenAI's GPT-4o model and may not be applicable to all LLM architectures. Additionally, the method requires substantial computational resources to test each attack multiple times, which could present challenges for researchers with limited access to AI systems.

The authors suggest that future work could explore how well their optimization strategy transfers to other LLM-based systems and whether defense mechanisms can be developed that are robust to this type of optimized prompting.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].

Featured

The New AI Security Rules, Perplexity's $34.5B Chrome Bid, More

Pure AI

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

TechMentor Orlando
November 16-21, 2025

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 3-7, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

TechMentor Orlando
November 15-20, 2026