Technical description
Trend Micro researchers disclosed 'Sockpuppeting', a jailbreak technique that bypasses safety guardrails on 11 major LLMs using a single line of code exploiting the API assistant prefill feature. Successfully extracted functional malware code and confidential system prompts.
Attack vector
Injection of a fake acceptance into the assistant-role message via standard API prefill feature, exploiting the model's self-consistency tendency to continue prohibited output. Requires only API access supporting assistant prefill—no model weights, optimisation, or specialised tooling.
Affected systems
GPT-4o, GPT-4o-mini, Claude 4 Sonnet, Gemini 2.5 Flash (most susceptible at 15.7% ASR), and 7 other major LLMs. Three models blocked at API layer.
Mitigation
Implement message-ordering validation that blocks assistant-role messages at the API layer. Apply output filtering for known attack patterns. Monitor API usage for anomalous prefill patterns.