What happened
This position paper argues that behavioral assurance methods (evaluations, red-teaming, system cards) are being asked to verify safety properties they cannot epistemically establish. AI governance frameworks enacted between 2019 and early 2026 require "reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability," but current assurance methodologies are limited to observable model outputs and cannot verify latent representations or long-horizon agentic behaviors. The authors formalize this as the "audit gap" — the divergence between required and achievable verification access — and introduce "fragile assurance" to describe cases where evidential structure does not support asserted safety claims. Through analysis of 21 governance instruments (including EU AI Act Article 55, California SB-53, Singapore AI Verify, South Korea AI Basic Act, and others), the paper identifies an incentive gradient where geopolitical and industrial pressures reward surface-level behavioral proxies over deep structural verification. The authors propose bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (linear probes, activation patching, before/after-training comparisons). Preprint, not peer-reviewed.
Why it matters
As frontier AI systems become more agentic and consequential, the gap between what governance demands and what auditors can verify creates systemic fragility. Regulators and boards relying on behavioral evaluations for high-stakes safety claims may be accepting assurance that cannot detect the properties it purports to measure — a structural risk that grows as models scale.
Action needed
Review your AI governance framework to distinguish between properties that can be verified through behavioral testing and those requiring mechanistic access. If your compliance strategy relies entirely on behavioral evaluation for high-consequence claims (e.g., absence of deception, bounded catastrophic capability), consider supplementing with mechanistic interpretability methods or adjusting claim scope to match evidential support.