WARD Guard Model Defends Web Agents Against Prompt Injection with Near-Perfect Recall

Technical description

Researchers introduced WARD (Web Agent Robust Defense against Prompt Injection), a guard model for securing web agents against prompt injection attacks embedded in HTML content or visual interfaces. WARD is trained on WARD-Base (177K samples from 719 high-traffic URLs) and WARD-PIG (dedicated dataset for guard-targeted attacks). The system achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates, and runs efficiently in parallel with the agent without added latency.

Attack vector

Web agents encounter adversarial prompt injections embedded in web pages they visit—through HTML comments, invisible CSS, or LLM-generated semantic prose within user reviews, forum posts, ads, or embedded widgets. Existing guard models suffer from limited generalization to unseen domains, high false positives, deployment latency, and vulnerability to adversarial attacks that evolve or target the guard directly.

Affected systems

Web agents that autonomously browse websites and interact with HTML content, including browser-based AI assistants, autonomous shopping agents, and research agents navigating open web environments. The defense applies to systems exposed to untrusted third-party content during task execution.

Mitigation

Deploy WARD as a parallel guard model inspecting webpage states (HTML and screenshots) before agent execution. WARD's adaptive adversarial training framework (A3T) enables iterative strengthening through memory-based attacker and guard co-evolution. The system's low latency design allows real-time protection without degrading agent performance.

WARD Guard Model Defends Web Agents Against Prompt Injection with Near-Perfect Recall

Technical description

Attack vector

Affected systems

Mitigation

Sources