Vulnerability  ·  2026-05-31

MemPoison — Stealthy Trojan Attack Injects Persistent Backdoors into LLM Agent Long-Term Memory via Ordinary Dialogue, Bypassing Selective Memory Defenses

VulnerabilityHigh impactGlobal
Researchers from Tsinghua University, PLA Information Engineering University, and affiliated institutions published MemPoison (arXiv:2605.29960, 2026-05-28), a novel memory poisoning attack against LLM agents with long-term memory. Unlike prior attacks that assume direct database write access, MemPoison operates entirely through ordinary black-box dialogue interactions. The attack uses three components: (1) a semantic relational bridge that combines trigger and malicious payload into a coherent sentence, ensuring both survive the agent's selective memory extraction; (2) entity masquerading that camouflages triggers as named entities to resist the agent's memory rewriting stage; (3) joint embedding optimization that clusters trigger-injected text near benign embeddings for stealth while maintaining separation for reliable retrieval. Evaluated across multiple agent domains and memory architectures, MemPoison achieves attack success rates up to 0.95 — substantially outperforming prior methods — while existing defenses (including detection-based and isolation-based approaches) fail to reliably mitigate it.
Adversary interacts with a memory-augmented LLM agent through its normal user interface. Crafted dialogue messages containing the trigger-payload construct pass through the agent's memory ingestion pipeline, are selectively extracted (surviving the filtering step), persist in long-term memory storage, and are subsequently retrieved on matching future queries — causing the agent to execute the attacker-specified behavior when the trigger condition is met. No privileged access required; the attack is repeatable across sessions.
LLM agents with long-term memory mechanisms (MemGPT-style systems, RAG-augmented agents with persistent episodic memory, customer service agents with session history, coding agents with project memory). Production deployments of OpenClaw, Codex, Claude Code, and any agent framework storing user-interaction history are structurally exposed if memory filtering can be bypassed.
No direct patch; researchers evaluated multiple defense strategies and found fundamental limitations in all. Recommended interim controls: (1) Treat agent long-term memory stores as adversarial inputs rather than trusted state — apply anomaly detection on extracted memory entries, particularly looking for unusually specific named-entity associations. (2) Limit memory persistence for untrusted or public-facing agents. (3) Require human review before memory entries that contain task-modifying instructions are committed to persistent storage. (4) Evaluate memory-augmented agents under MemPoison-class adversarial inputs before production deployment.
Sources
arXiv:2605.29960 — Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
See this in the live feed Explore related AI security and governance findings — updated every morning.
Open the feed →