Definition
An attack against AI models that show their reasoning step-by-step (so-called 'reasoning' or 'thinking' models). An attacker injects fake reasoning text that mimics how the model thinks internally, tricking it into bypassing its own safety rules — achieving near-100% success rates in research tests. The more a model 'thinks out loud,' the more surface area an attacker has to manipulate.
Why it matters
Safety reviews often assume that more-capable, more-deliberate AI models are safer — this attack inverts that assumption, meaning your most powerful AI assistants may be your most exploitable ones. Boards should ask vendors whether their reasoning models have been specifically tested against this attack class.