Technical description
A Nature Communications study tested four reasoning models (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) as autonomous attackers against nine targets, finding a 97.14% jailbreak success rate — converting jailbreaking from an expert craft into a cheap, scalable attack.
Attack vector
Autonomous multi-step prompt engineering where reasoning models generate and iterate jailbreak prompts based on target responses.
Affected systems
All deployed LLMs; enterprise deployments of open-weight models face heightened risk.
Mitigation
Proactive defences (e.g., ProAct) that inject spurious responses to disrupt attacker feedback loops; LLM salting; jailbreak distillation for evaluation; alignment work to prevent frontier reasoning models from being co-opted as attackers.