Major AI security flaws exposed in red teaming competition

Chat screenshot with prompt injection that discloses Nova Wilson's medical data (height, weight, diagnoses) without authorization.

A groundbreaking red teaming study has uncovered alarming security weaknesses in today's most advanced AI agents. Between March 8 and April 6, 2025, nearly 2,000 participants launched 1.8 million attacks against 22 AI models from leading labs including OpenAI, Anthropic, and Google Deepmind.

Universal Vulnerabilities Exposed

The competition, organized by Gray Swan AI and hosted by the UK AI Security Institute, revealed that:

100% of tested models failed at least one security test
Attackers achieved 12.7% average success rate
Over 62,000 successful attacks resulted in policy violations

Stacked bar chart showing attack success rates from 20-60% to nearly 100%

Attack Methods and Results

Researchers targeted four key behavior categories:

Confidentiality breaches
Conflicting objectives
Prohibited information
Prohibited actions

Indirect prompt injections proved particularly effective, working 27.1% of the time compared to just 5.7% for direct attacks. These attacks hide malicious instructions in websites, PDFs, or emails.

Bar chart showing attack success rates for AI models

Model Performance

While Anthropic's Claude models demonstrated the most robust security, even they weren't immune:

Claude 3.5 Haiku showed surprising resilience
Claude 3.7 Sonnet (tested before Claude 4's release) still had vulnerabilities
Attack techniques often transferred between models with minimal modification

Heat map of transfer attack success rates between models

Common Attack Strategies

Successful methods included:

System prompt overrides using tags like '<system>'
Simulated internal reasoning ('faux reasoning')
Fake session resets
Parallel universe commands

Example attack prompts showing universal vulnerabilities

Creating a New Benchmark

The competition results formed the basis for the Agent Red Teaming (ART) benchmark, a curated set of 4,700 high-quality attack prompts. This will be maintained as a private leaderboard updated through future competitions.

Industry Implications

The findings come as:

OpenAI rolls out agent functionality in ChatGPT
Google focuses on AI agent capabilities
Even OpenAI's CEO warns against using AI agents for critical tasks

The study authors conclude: "These findings underscore fundamental weaknesses in existing defenses and highlight an urgent and realistic risk that requires immediate attention."

For more technical details, see the full research paper.

Major AI security flaws exposed in red teaming competition

Universal Vulnerabilities Exposed

Attack Methods and Results

Model Performance

Common Attack Strategies

Creating a New Benchmark

Industry Implications

Related News

AI Agents Fuel Identity Debt Risks Across APAC

Dynamic Context Firewall Enhances AI Security for MCP

About the Author

Dr. Sarah Chen

Expertise

Universal Vulnerabilities Exposed

Attack Methods and Results

Model Performance

Common Attack Strategies

Creating a New Benchmark

Industry Implications

Related News

AI Agents Fuel Identity Debt Risks Across APAC

Dynamic Context Firewall Enhances AI Security for MCP

About the Author

Dr. Sarah Chen

Expertise

Agent Newsletter

Get Agentic Newsletter Today