Large Reasoning Models Achieve 97% Jailbreak Success as Autonomous Attackers

Technical description

A Nature Communications study tested four reasoning models (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) as autonomous attackers against nine targets, finding a 97.14% jailbreak success rate — converting jailbreaking from an expert craft into a cheap, scalable attack.

Attack vector

Autonomous multi-step prompt engineering where reasoning models generate and iterate jailbreak prompts based on target responses.

Affected systems

All deployed LLMs; enterprise deployments of open-weight models face heightened risk.

Mitigation

Proactive defences (e.g., ProAct) that inject spurious responses to disrupt attacker feedback loops; LLM salting; jailbreak distillation for evaluation; alignment work to prevent frontier reasoning models from being co-opted as attackers.

Large Reasoning Models Achieve 97% Jailbreak Success as Autonomous Attackers

Technical description

Attack vector

Affected systems

Mitigation

Sources