The five steps
The protocol is a fixed five-step shape. Steps 1–3 produce a research-grounded design dispatch; Step 4 verifies it; Step 5 wires the evidence to the architecture.
The one-line guard: no finding reaches Step 5 unverified. If you cannot verify — verifier down, no different family reachable, retrieval oracle unreachable — you halt and escalate; you do not proceed.
Step 1 — Identify load-bearing decisions
Section titled “Step 1 — Identify load-bearing decisions”List the specific questions where empirical evidence would change the answer.
Aim for 3–5. Fewer is fine when the decision is genuinely substantial — run with 1–2 agents. The decision-to-investigate governs invocation; the number of evidence-changing questions governs breadth. Don’t manufacture questions to hit a count, and don’t abort for being under three. More than ~6 → split into multiple passes.
A question is load-bearing if:
- you can picture two different designs that hinge on the answer;
- the honest current answer is “I think…”, not “evidence says…”;
- an adjacent field (HCI, SRE, compilers, databases) has probably already measured it;
- getting it wrong ships a known-broken default.
Worked decomposition: “Should a retry reuse the previous output?” splits into “does context carryover cause sycophancy drift?” and “do explanations increase over-reliance on wrong answers?” — two evidence-changing questions, not one opinion.
Step 2 — Dispatch parallel research agents
Section titled “Step 2 — Dispatch parallel research agents”One agent per question, dispatched in parallel (a single batch). Each agent’s prompt must demand:
- the context — what’s being built, why the question matters;
- the question scoped to evidence, not opinion;
- specific findings: paper titles, authors, years, URLs, and a one-sentence key finding per source;
- a word cap (typically 500–600);
- “specificity over breadth — 6–8 well-sourced findings beat 20 vague gestures.”
Existence is established in Step 4 by resolving the identifier, not by recall — so a paper an agent “remembers” but cannot retrieve does not enter the dispatch.
Step 3 — Synthesize into a “Research grounding” section
Section titled “Step 3 — Synthesize into a “Research grounding” section”Put a dedicated section near the top of the design doc, before the decisions. One template per finding:
N. **<one-sentence finding>.** <Authors> <year> (<arXiv:NNNN.NNNNN or DOI>). <design implication>.For example:
- Contrastive explanations with a predicted human foil improve independent decision-making. Buçinca et al. 2024 (arXiv:2410.04253). Implication: every recommendation carries a “you might think X; I’m recommending Y because…” frame.
The template does three things at once: states the finding, cites a verifiable source, and names the design implication — so the link from evidence to choice is visible.
Step 4 — External verification gate
Section titled “Step 4 — External verification gate”A verifier of a different model family, reasoning-stripped, checks every citation before it informs the design. This step is large enough to have its own page.
Step 5 — Connect findings to architecture
Section titled “Step 5 — Connect findings to architecture”The decision section references findings by number where they justify a choice; each load-bearing choice traces to at least one finding. Citations without a connection are noise.
“Retry uses a fresh prompt without the previous output. (sycophancy mitigation, Kim 2025.)”
The choice is annotated with the source and the reason, so a reader knows why the rule exists — not just that it does.