Benchmarks

75.8% balanced accuracy, 14.6 ms.

Director-AI's production-validated metric is response-level hallucination scoring on LLM-AggreFact. Numbers below are committed measurements; the streaming contradiction halt is opt-in and evidence-bound, not a sole production gate.

75.8%balanced accuracy, LLM-AggreFact

14.6 msNLI tier latency

17.9 msRust NLI per pair

<0.5 msheuristic tier

Accuracy vs latency

Each tier buys accuracy with time.

The scorer climbs only as high as a claim needs. Cheap tiers settle the easy cases in microseconds; NLI handles what is still uncertain.

Scoring tier	Balanced accuracy	Latency	Notes
NLI (cross-encoder)	75.8%	14.6 ms	Default production tier on LLM-AggreFact
NLI (larger model)	77.4%	~40 ms	Higher accuracy, higher cost
Embeddings	~73%	~15 ms	Semantic support from retrieved evidence
Heuristic / rules	~55%	<0.5 ms	Model-free, free, settles easy cases first

Rust acceleration

NLI scoring per claim pair.

The Rust-accelerated path keeps the hot NLI loop an order of magnitude ahead of pure-Python backends.

Backend	Latency per pair
Rust NLI	17.9 ms/pair
PyTorch (CPU)	80.1 ms/pair
ONNX	118.9 ms/pair
Transformers (CPU)	207.3 ms/pair
Heuristic	<0.5 ms/pair

Backend names above are illustrative of the measured tiers; exact per-package figures live in the repository's benchmarks/results/. Reproduce with the committed benchmark scripts.

Honest boundary

What is production-validated, and what is not.

Validated

Response-level hallucination scoring — the 5-tier engine, SDK guard, FastAPI middleware, REST server, injection detection, and the agent/MCP preflight guard.

Opt-in

The streaming contradiction halt is evidence-bound and opt-in; current local evidence is recorded in the repo and should not be a sole production gate.

Reproducible

Every number is a committed measurement. The benchmark scripts and result artefacts ship in the repository.