Definition
A formal scoring system that rates how dangerous a successful AI jailbreak is — measuring factors such as how far the safety bypass went, what harmful capabilities became accessible, how easily it can be repeated, and what real-world harm could result. The White House and Anthropic are actively developing the first government-industry version of such a benchmark.
Why it matters
Without an agreed severity scale, governments and companies have no shared language for deciding when an AI model is too dangerous to deploy or must be recalled — the benchmark is the foundation for any credible AI model governance regime.