The Reproducibility Test

Almost every AI governance product on the market now describes itself as "policy enforcement." Most of them are not. Underneath the language, they are a model — a classifier, an LLM judge, a risk scorer — producing a number, and a threshold deciding what to do with it. That is a useful thing. It is not a control, and in a regulated setting the difference is the whole game.

A control is something you can reproduce, reconstruct, and defend to someone who was not in the room. A guardrail that returns "92% safe" cannot do any of those things, because the next run might say 88%, the run after that might say 94%, and none of the three can tell you, with certainty, why. "Probably safe" is a forecast. An examiner does not accept a forecast as evidence that a required control was in place.

So how do you tell the two apart before you sign? Run the reproducibility test. Five questions. Each one is answerable in a demo, not a sales deck — and each one a probabilistic system fails in a specific, observable way.

1. Same input, same verdict — every time?

Send the identical request twice, then twenty times. A deterministic system returns the byte-identical verdict on every run, because the decision is computed by rules over the inputs, not sampled from a distribution. A probabilistic guardrail drifts: temperature, model version, and load all move the number. If a vendor cannot demonstrate same-input / same-output on demand, everything downstream — the audit trail, the "policy," the certificate — is being built on a foundation that changes when you aren't looking.

Deterministic: identical verdict, identical reasoning, identical record.
Probabilistic: verdicts cluster near a threshold and cross it unpredictably.

2. Does the decision happen before the action, or after?

This is the question that separates a brake from a dashboard light. Many "guardrails" run after the model has already produced its output — they score the result and hope to catch it before it ships. A control runs before the consequential action is dispatched, so a blocked action never happens at all. Ask exactly where in the request lifecycle the verdict is computed. "Pre-execution" and "post-hoc filtering" are not two flavors of the same thing; only one of them can actually stop harm.

3. Does the record name the rule — or just the score?

Pull the audit record for a single blocked decision and read it. A deterministic record cites the specific rule or policy clause that fired, the inputs it evaluated, and the disposition that followed. A probabilistic record contains a number and, if you are lucky, a feature attribution. When an examiner asks "why was this action blocked," the first record answers the question; the second restates it. The test is simple: can a colleague who has never seen the system read the record and reconstruct the decision without asking the vendor?

4. Can a third party replay the decision without you?

This is the one most vendors quietly fail. A genuine governance decision should be independently verifiable: a third party — your auditor, your regulator, opposing counsel — should be able to take the recorded inputs and the published policy, re-run the decision, and arrive at the same verdict, without trusting the vendor's word and without the vendor's involvement. Cryptographically signed evidence makes this stronger still: the signature proves the record has not been altered since it was issued, and verification needs only a public key. If "verification" means logging back into the vendor's dashboard, it is not verification. It is a screenshot.

Reproducibility and replay are not the same property. Reproducibility means the system returns the same verdict for the same input. Replay means someone other than the system can confirm that verdict after the fact. A vendor can have the first without the second — and only the second survives an adversarial audit.

5. Does the refusal hold under pressure?

Finally, test the refusal itself. Take a request the system blocks, then rephrase it, escalate it, wrap it in a plausible business justification, and send it again. A deterministic boundary holds because it is enforced by logic that does not negotiate. A probabilistic boundary can often be talked across — the same prompt-engineering that jailbreaks a model also erodes a model-based guardrail, because they are the same kind of artifact. The hardest, least-overridable controls are the ones computed by a small, auditable rule set with no path to "convince it otherwise."

Why we publish the test we'd be measured against

It would be easy to keep this checklist private. We publish it because deterministic governance only means something if buyers can verify the claim — and a standard you can apply to every vendor, ours included, is more useful than a standard that only flatters one. EVE's architecture is built to pass all five: a pre-execution decision layer that returns the same verdict for the same inputs, records that cite the rule that fired, and decision evidence you can verify against a public key without us in the loop.

If you want to see it run rather than read about it, the regulatory mappings show which obligations each capability satisfies, and a governance assessment walks the five questions against your own highest-risk workflows.

1. Same input, same verdict — every time?

2. Does the decision happen before the action, or after?

3. Does the record name the rule — or just the score?

4. Can a third party replay the decision without you?

5. Does the refusal hold under pressure?

Why we publish the test we'd be measured against

Related

Book a Governance Assessment

Principles, made enforceable.