Attack  ·  Glossary

Chain-of-thought hijacking

An attack against AI models that show their reasoning step-by-step (so-called 'reasoning' or 'thinking' models). An attacker injects fake reasoning text that mimics how the model thinks internally, tricking it into bypassing its own safety rules — achieving near-100% success rates in research tests. The more a model 'thinks out loud,' the more surface area an attacker has to manipulate.
Safety reviews often assume that more-capable, more-deliberate AI models are safer — this attack inverts that assumption, meaning your most powerful AI assistants may be your most exploitable ones. Boards should ask vendors whether their reasoning models have been specifically tested against this attack class.
References
NeuralTrust: Chain-of-Thought Hijacking Research
Track this in the live feed See how this plays out in real AI security and governance developments.
Open the feed →