Definition
Safety filters and rules built around an AI model to prevent it from producing harmful, off-topic, or policy-violating outputs. Guardrails may check what the user sends in, what the AI is about to say, or both. They can be built by the AI provider, the company deploying the AI, or both working together.
Why it matters
Guardrails are the primary line of defence between a capable AI and misuse, but research has proven that no finite set of guardrails is unbreakable. They must be continuously updated as new attacks emerge — and paradoxically, very sophisticated guardrails can themselves be weaponised in denial-of-service attacks.