# Curated hallucination examples (hand-reviewed)

The winning configuration (`gpt-5-mini / grok-4.3 / mistral-small`) produced **3
ungrounded claims across 804 claims on 30 vendors (0.37%)**. All three are shown
below in full, hand-redacted: vendor names, domains, and URLs are removed; the
substance of each fabricated/ungrounded claim is preserved. Each was caught by
the deterministic detector (no LLM judge), then re-synthesized on the more
expensive flagship model (Opus 4.8) to test the fix loop.

The redactions here were done by hand; the bulk `defects-all-cells.json` removes
free text entirely.

---

## 1 · `healthcare-clinical-g38` — ungrounded self-assessed score

| field | value |
|---|---|
| step-02 model | `x-ai/grok-4.3` |
| section | Identity |
| provenance tags | `[VENDOR-CLAIMED] [ESTIMATED]` |
| detector verdict | **ungrounded** — *"sourced tag with no URL"* |
| claim (verbatim) | `- **ai_native_score**: 50.` |

The model emitted a numeric `ai_native_score` of 50, tagged it as vendor-claimed
and estimated, but attached **no source URL**. The deterministic check requires a
sourced tag to carry a citation present in the frozen bundle; with none, the
number is unsupported.

**Opus re-sample:** cleared — the flagship re-synthesis produced 0 ungrounded claims.

---

## 2 · `healthcare-rcm-g31` — fabricated verification step

| field | value |
|---|---|
| step-02 model | `x-ai/grok-4.3` |
| section | Pricing detail |
| provenance tags | `[UNTAGGED]` |
| detector verdict | **ungrounded** — *"no recognized provenance tag"* |
| claim (redacted) | `Wayback regression check: [vendor-domain]/pricing resolves only to landing page with no archived snapshots of published tiers.` |

The model asserted it had run a "Wayback regression check" against the Internet
Archive — a plausible-sounding verification step that is **not in the frozen
evidence bundle and carries no provenance tag**. The detector flags it as
ungrounded regardless of whether the underlying statement happens to be true.

**Opus re-sample:** **not cleared** — the flagship *also* produced an ungrounded
pricing claim here (`[vendor] does not publish self-serve price tiers…`). This is
the 1-of-3 the fix loop could not resolve: a vendor where even the flagship can't
stay grounded on thin evidence, so correctness depends on the gate plus human
escalation, not the model.

---

## 3 · `tax-g59` — fabricated verification step

| field | value |
|---|---|
| step-02 model | `x-ai/grok-4.3` |
| section | Pricing detail |
| provenance tags | `[UNTAGGED]` |
| detector verdict | **ungrounded** — *"no recognized provenance tag"* |
| claim (redacted) | `Wayback regression check: [date] capture of package-options page listed main tiers as "Contact for Pricing".` |

Same pattern as #2: an invented "Wayback regression check" asserting a specific
archived capture, untagged and absent from the frozen bundle.

**Opus re-sample:** cleared — 0 ungrounded claims on flagship re-synthesis.

---

**Fix-loop summary:** one flagship redo cleared **2 of 3** (`healthcare-clinical-g38`,
`tax-g59`); the third (`healthcare-rcm-g31`) remained because the flagship
hallucinates there too. The validation gate, not the choice of model, is what
guarantees a clean published dossier.
