Technical description
Google DeepMind researchers published the first systematic framework for understanding web-based attacks against autonomous AI agents. The paper identifies six categories of 'AI Agent Traps': content injection, semantic manipulation, cognitive state corruption, data exfiltration, systemic attacks, and human-in-the-loop manipulation. Data exfiltration attack success rates exceeded 80% across five tested agents.
Attack vector
Attackers embed malicious instructions in HTML comments, invisible CSS-positioned text, or steganographic image data. These instructions are invisible to human moderators but processed by AI agents. RAG knowledge poisoning achieves backdoor success rates exceeding 80% at less than 0.1% data poisoning.
Affected systems
All autonomous AI agents that browse the web, process external documents, or interact with retrieval-augmented generation systems. Includes agents built on GPT, Claude, Gemini, and other major LLM platforms.
Mitigation
Implement input sanitisation for agent-consumed content, deploy runtime defenses against prompt injection, establish content governance frameworks, and maintain human oversight for high-stakes agent actions. The paper recommends training data augmentation to harden underlying models.