Research release
We audited our own AI for hallucinations. Here is the method, the data, and the script that checks our math.
Every review on this site is researched by an AI pipeline. The fair question: how do we know it is not making things up? We measured it, and we published everything needed to check us.
What we found
25 models, the 30 hardest companies we cover, 454 audited runs, every claim checked by deterministic code against frozen copies of real source pages. No model produced zero hallucinations, including the most expensive on the market. What keeps them out of what you read is the architecture: a gate that blocks any unverified claim caught every hallucination the winning pipeline produced, and zero reached a published page. Production adds one more layer of testing on top of everything the paper describes.
The whitepaper explains the method so any team can run the same audit on its own pipeline. The dataset ships with it, licensed CC BY 4.0: share and republish with attribution.
Replicate it in three steps
- Read the data dictionary: every file and every column, defined. About five minutes.
- Run verify_results.py next to LEDGER.json (Python 3.9 or later, standard library only, no network). It recomputes every headline number in the paper and prints PASS or FAIL against each one.
- Audit any single verdict in defects-all-cells.json, claim by claim. To grade your own pipeline on the same task, follow the 10-question retrieval test.
The release
- WHITEPAPER.pdf: the methods paper, 18 pages.
- LEDGER.json / LEDGER.csv: one row per company and model combination, with claims, hallucinations, gate result, and cost.
- defects-all-cells.json: the deterministic detector's verdict for every cell.
- verify_results.py + DATA-DICTIONARY.md: the verification script and the column-level documentation.
- PROCESS-ARCH.png: the stage-by-stage process map.
- VENDOR-QUESTIONS.md: the retrieval test contract for outside systems.
- results-headline-configs.csv, results-per-combination.csv, rates-per-combination.md, curated-examples.md: rate tables and hand-reviewed examples.
- LICENSE: CC BY 4.0. README.md: the replication quick-start.
If a number does not reproduce, tell us: hello@yardstickresearch.app.