Benchmarks

75.8% balanced accuracy, 14.6 ms.

Director-AI's production-validated metric is response-level hallucination scoring on LLM-AggreFact. Numbers below are committed measurements; the streaming contradiction halt is opt-in and evidence-bound, not a sole production gate.

75.8%balanced accuracy, LLM-AggreFact
14.6 msNLI tier latency
17.9 msRust NLI per pair
<0.5 msheuristic tier

Accuracy vs latency

Each tier buys accuracy with time.

The scorer climbs only as high as a claim needs. Cheap tiers settle the easy cases in microseconds; NLI handles what is still uncertain.

Scoring tierBalanced accuracyLatencyNotes
NLI (cross-encoder)75.8%14.6 msDefault production tier on LLM-AggreFact
NLI (larger model)77.4%~40 msHigher accuracy, higher cost
Embeddings~73%~15 msSemantic support from retrieved evidence
Heuristic / rules~55%<0.5 msModel-free, free, settles easy cases first

Rust acceleration

NLI scoring per claim pair.

The Rust-accelerated path keeps the hot NLI loop an order of magnitude ahead of pure-Python backends.

BackendLatency per pair
Rust NLI17.9 ms/pair
PyTorch (CPU)80.1 ms/pair
ONNX118.9 ms/pair
Transformers (CPU)207.3 ms/pair
Heuristic<0.5 ms/pair

Backend names above are illustrative of the measured tiers; exact per-package figures live in the repository's benchmarks/results/. Reproduce with the committed benchmark scripts.

Honest boundary

What is production-validated, and what is not.

Validated

Response-level hallucination scoring — the 5-tier engine, SDK guard, FastAPI middleware, REST server, injection detection, and the agent/MCP preflight guard.

Opt-in

The streaming contradiction halt is evidence-bound and opt-in; current local evidence is recorded in the repo and should not be a sole production gate.

Reproducible

Every number is a committed measurement. The benchmark scripts and result artefacts ship in the repository.