The verification gate
Step 4 is the heart of study-swarm. Before any finding informs the design, a verifier of a different model family from the synthesizer — with the synthesizer’s reasoning hidden — checks every citation. The Step 2 research agents are inputs (they produce citations); they are not verifiers of the synthesis. A separate family must check, or it’s a model grading its own homework — the exact failure the protocol prescribes verifiers to prevent.
Non-circular by construction. The verifier adjudicates via a deterministic retrieval oracle (existence) plus a different-family lens (groundedness). It does not re-run this protocol and does not rely on anyone’s recall.
The two-stage check, per citation
Section titled “The two-stage check, per citation”- Existence / attribution — a retrieval oracle, not a parametric LLM. Resolve the arXiv ID / DOI / URL and confirm the paper exists with the stated title, authors, and year. This stage must retrieve (fetch the source / arXiv / Crossref), never model memory.
- Groundedness — finding matches source. Confirm the one-sentence finding describes what the source actually claims. Existence is not enough: a real paper can still be cited for something it never said.
Why a different family, reasoning-stripped
Section titled “Why a different family, reasoning-stripped”- LLMs can’t reliably self-verify — the external verifier carries the gains; the self-critique content is largely inert (Huang 2023, Kambhampati 2024, Stechly 2024).
- Same-family judges self-prefer — self-recognition correlates linearly with self-preference, so partial blinding doesn’t help (Panickssery 2024); a disjoint-family panel is less biased at lower cost (Verga 2024).
- Hide the reasoning — manipulated chain-of-thought alone can inflate a judge’s false-positives dramatically (Khalifa 2026), and CoT is often post-hoc rationalization (Turpin 2023). The verifier sees the bare citation claim, never the “why.”
See Research grounding for the full citations.
Ensemble: ≥3 decorrelated lenses
Section titled “Ensemble: ≥3 decorrelated lenses”Count the retrieval oracle as one mechanism-diverse lens, then add ≥2 different-family LLM lenses. Diversity of lenses — not raw count — is the load-bearing variable, because LLM errors are correlated (Rajan 2025, Kim 2025).
The halt table
Section titled “The halt table”Scope is per-finding — other verified findings proceed.
| Verdict / condition | Action |
|---|---|
| Fabricated | The finding is dropped — no real source to correct, so no re-verification. |
| Misattributed | Correct the attribution and re-verify once; a second non-clean verdict drops it. |
| Cannot confirm | Removed from the design AND surfaced to a human with a contrastive frame — “you probably expected finding N citable; I left it out because the oracle couldn’t confirm it — override with X.” Reinstated only if a human confirms the source. |
| Verifier / oracle unavailable | Halt and escalate. Unavailability is never read as “citations are fine,” and never as fabrication. Proceeding without a completed verification is forbidden. |
| No different family reachable | The retrieval oracle still runs and gates existence; the groundedness LLM lens halts and escalates rather than running same-family. |