Definition
A pre-launch testing method where an AI model is exposed to a realistic sample of actual user conversations—drawn from production logs—before it goes live. This lets developers see how the model will behave in the real world, not just in controlled lab scenarios, and catch safety or quality failures before they reach customers.
Why it matters
Lab safety tests regularly miss failure modes that only appear with genuine user behaviour patterns, meaning models can pass internal evaluations and still fail badly once deployed. Grounding pre-release checks in real usage data directly reduces the gap between how safe a model appears and how safe it actually is.