# Data dictionary

Every file and every column in this release, defined. Read this first; it pairs with
`verify_results.py`, which recomputes the paper's headline numbers from `LEDGER.json`.

Conventions used throughout:

- **Vendors are anonymized** as `cohort-g<gold>`: the industry cohort plus the vendor's
  human-validated gold score (for example `fintech-g32`). Where two vendors in one cohort
  share a gold score, a suffix letter disambiguates (`-a`, `-b`).
- **Model names are real**; they are the comparison being published.
- Inside any retained text field, removed content is replaced in place: `[URL]` for a URL
  or bare domain, `[VENDOR]` for the vendor's name, `(sample removed)` for quoted samples.

## 1. LEDGER.json / LEDGER.csv

The results ledger: **one row per (vendor x model-combination) run**, 454 rows. This is
the file every rate, cost, and containment number in the paper is computed from. The CSV
is a 12-column subset of the JSON; the JSON additionally carries `arm`, `model_01`,
`model_02`, `model_03`, `receipts`, and `exact_cost`.

| column | meaning |
|---|---|
| `vendor_anon` | anonymized vendor id, `cohort-g<gold>` |
| `cohort` | industry cohort the vendor belongs to |
| `arm` | `current` = the all-Claude control; `new` = a candidate configuration (JSON only) |
| `config` | the (step-01 / step-02 / step-03) model combination, short names |
| `model_01` / `model_02` / `model_03` | full model ids per step: 01 cohort-fit screen, 02 synthesis (the hallucination-bearing step), 03 validation (JSON only) |
| `claims` | provenance-tagged claims in the synthesized dossier that the detector graded; the rate denominator |
| `receipts` | claims that carry a verbatim supporting quote (JSON only) |
| `hallucinations` | ungrounded claims: no citation, the cited source is absent from the frozen bundle, or the quoted receipt is not verbatim in the bundle. The gate metric |
| `mistags` | claims whose content is grounded but whose provenance label does not match the source tier. Auto-fixable label errors, tracked separately, not the gate metric |
| `raw_escapes` | hallucinations + mistags, before the gate is applied |
| `postgate_escapes` | defects that would still publish after the gate verdict (always 0 when `gate_blocked` is true) |
| `gate_blocked` | whether the fail-closed gate (`validate_dossier`) blocked this dossier from publishing |
| `score_delta` | the configuration's computed vendor score minus the human-validated gold score |
| `gold` | the human-validated reference score for the vendor (also embedded in `vendor_anon`) |
| `cost_usd` | per-vendor cost of the full three-step run at the pinned 2026-06-07 price snapshot |
| `exact_cost` | `true`: summed from provider-billed token usage. `false`: computed from recorded token counts at the pinned list prices (the Claude control runs in-harness on a subscription, so its retail-at-scale cost is computed, not billed) (JSON only) |

Derived quantities the paper uses:

- hallucination rate = `sum(hallucinations) / sum(claims)` over a configuration's rows
- mistag rate = `sum(mistags) / sum(claims)`
- re-run rate = share of a configuration's vendors with `hallucinations > 0`
- effective cost per vendor = `mean(cost_usd) + re-run-rate x 0.49` (the modeled cost of
  redoing a hallucinated dossier on the flagship; section 5.6 of the paper measures the
  real redo at about $0.265, which only lowers the winner's effective cost)
- containment = `gate_blocked` and `postgate_escapes` per row; "would have published" =
  hallucinations on rows where `gate_blocked` is false

## 2. defects-all-cells.json

The deterministic detector's full verdict for **457 cells**: the 454 ledger cells plus
the three `REDO-opus-fix` cells (the measured fix loop, paper section 5.6). Free text is
stripped so nothing can de-anonymize a vendor.

Cell-level fields:

| field | meaning |
|---|---|
| `vendor_anon`, `cohort_id`, `gold_headline_score` | vendor identity, as in the ledger |
| `config_id` | `current` for the control; `01-<model>__02-<model>__03-<model>` for a candidate configuration; `REDO-opus-fix` for a fix-loop redo |
| `bundle_sources_frozen` | number of sources in the vendor's frozen evidence bundle |
| `arms` | one entry per arm present in this cell, fields below |

Arm-level fields (inside `arms`):

| field | meaning |
|---|---|
| `vd_gate_verdict` / `vd_gate_fails` | the fail-closed gate's verdict and which checks failed |
| `gate_blocks_publish` | whether the gate blocks this dossier (the ledger's `gate_blocked`) |
| `claims_total` / `grounded_ok` | graded claims and how many grounded cleanly |
| `H_counts` | defects by category H1 to H7 (taxonomy below) |
| `escape_count` | hallucinations + mistags (the ledger's `raw_escapes`) |
| `scores` / `score_agreement` | the dossier's computed score, and gold / computed / delta |
| `validate_dossier` | per-check status tokens (`pass` / `fail` / `skipped`). Detail strings are deliberately reduced to the status token, and check names are normalized for release: the details and internal names can embed the proprietary scoring methodology |
| `ungrounded` / `mistagged` | one entry per defective claim: `section` (dossier section; the scored-assessment sections are collapsed to the single label `Scored fields` for release), `tags` (provenance tags as written), `why` (the detector's reason), `urls_n` (how many URLs the claim cited; the URLs themselves are removed) |
| `banned_urls_n` | claims citing a banned aggregator source |
| `unresolved_n` | claims the detector could not resolve, routed to human spot-check |

## 3. Defect taxonomy (H1 to H7)

| code | meaning |
|---|---|
| H1 | fabricated quote: the receipt does not appear in the cited source |
| H2 | dead URL: the cited URL is not in the frozen bundle |
| H3 | misattribution: grounded content, wrong provenance label (the mistag class) |
| H4 | unsupported claim: a sourced tag with no citation behind it |
| H5 | fabricated entity fact: an invented concrete fact about the company |
| H6 | phantom integration: an integration the evidence never names |
| H7 | estimate-as-fact: an estimate stated as a sourced fact |

Hallucination = H1 + H2 + H4 + H5 + H6 + H7. Mistag = H3.

## 4. Provenance tags

`VENDOR-CLAIMED` (the vendor's own site or press release), `THIRD-PARTY` (independent:
news, analyst, regulator, review), `MEASURED` (directly observed or tested),
`ESTIMATED` (the system's own labeled inference), `UNKNOWN` (no public source found;
a correct answer, not a defect).

## 5. Flat result tables

- `results-headline-configs.csv`: the four full-pool configurations. Columns:
  `config_01_02_03`, `vendors`, `claims`, `hallucinations`, `rate_pct`, `ci95_low_pct` /
  `ci95_high_pct` (95% vendor-level cluster-bootstrap interval), `fisher_vs_control`.
- `results-per-combination.csv`: all 31 model combinations. Columns: `step_01`,
  `step_02`, `step_03`, `vendors`, `claims`, `hallucinations`, `hallucination_rate_pct`.
- `rates-per-combination.md`: the same 31 combinations as a readable table.

## 6. What is not in this release

Vendor names, domains, source URLs, verbatim receipts (they identify vendors);
Yardstick's internal scoring methodology (proprietary); the frozen evidence
bundles themselves (they are the vendor's source pages and would de-anonymize). The
15-vendor panel rates quoted in the paper's reliability table are historical values
recorded before the confirmation extension and are reported, not recomputable; every
full-pool value is recomputable. The detector is deterministic, so the retained verdict
fields are sufficient to reproduce every rate in the paper: run `verify_results.py`.
