# Grounded-pipeline audit: public data release

Reproducibility data for the methods paper `WHITEPAPER.pdf` (v1.1, 2026-06-11). Vendors are the 30 hardest (bottom-tercile, thin-evidence) and are anonymized as `cohort-g<gold>`; model names are kept because they are the comparison.

## Replicate the paper in three steps
1. Read `DATA-DICTIONARY.md` (about five minutes): every file and every column, defined.
2. Run `python verify_results.py` in this folder (Python 3.9+, standard library only, no network, no model calls). It recomputes every headline number in the paper from `LEDGER.json` alone (rates, costs, containment, confidence intervals, the Fisher test, the Clopper-Pearson bound, the power statement) and prints PASS or FAIL against each value the paper states.
3. Audit any individual verdict in `defects-all-cells.json` (per-claim detector output per cell) against its ledger row.

`PROCESS-ARCH.png` is the stage-by-stage process map of the pipeline under test: who retrieves and freezes the evidence, which model runs each step, what the deterministic detector checks, and what the fail-closed gate plus fix loop do before anything publishes.

## Containment, in one line
Under the recommended configuration every hallucination was caught before publication: 3 of 3 blocked by the fail-closed gate, 0 would have reached a customer (per-row `gate_blocked` and `postgate_escapes` in the ledger). The all-flagship control, judged behind the same gate, would have published 1 of its 6.

## Test your own system
`VENDOR-QUESTIONS.md` is the retrieval and grounding test contract: the factual questions each system must answer per vendor, and the output contract (a one- or two-sentence summary, a verbatim supporting quote, the live source URL, and a provenance tag, for every answer). Grading is deterministic: returned quotes are string-matched at their cited URLs and against a fact-checked answer key of validated quotes Yardstick holds for these vendors. Yardstick's scoring methodology is proprietary and deliberately excluded; this dataset tests retrieval accuracy and hallucination containment, not scoring.

## License
Creative Commons Attribution 4.0 International (CC BY 4.0): share, republish, and adapt freely, with attribution to Yardstick Research. Full text in `LICENSE`.

## Files
- `WHITEPAPER.pdf`: the methods paper (v1.1, 2026-06-11).
- `DATA-DICTIONARY.md`: every file and every column, defined. Read first.
- `verify_results.py`: recomputes the paper's headline numbers from `LEDGER.json`; prints PASS/FAIL.
- `PROCESS-ARCH.png`: the stage-by-stage process map (Figure 5 in the paper).
- `VENDOR-QUESTIONS.md`: the retrieval question set + exact-quote output contract described above.
- `LEDGER.json` / `LEDGER.csv`: one row per vendor x model-combination (claims, hallucinations, mistags, gate result, cost).
- `rates-per-combination.md`: hallucination rate for each of the model combinations.
- `defects-all-cells.json`: the deterministic detector's verdict for every cell (the 454 ledger cells plus the 3 opus fix-loop redos): per-claim `section` / `tags` / `why`, H1-H7 counts, gate result, scores. **Free text (claim text, source receipts, URLs, domains) is removed** so nothing can de-anonymize a vendor.
- `curated-examples.md`: the winner's three hallucinations shown in full, hand-redacted.
- `results-per-combination.csv` / `results-headline-configs.csv`: flat tables of the same rates and costs.

## What was removed
Vendor names, domains, source URLs, and verbatim source receipts, because they identify the vendor (directly or via acquirers, founders, and deal figures). The detector is deterministic, so the retained verdict fields are sufficient to reproduce every rate.
