Recovery — The Three R's
The swarm control plane ships three recovery verbs. They are siblings, not synonyms — each one names a different shape of recovery, and each one writes its own prefix into the audit trail so a future inspector can grep wave_state_events / agent_state_events for the verb that wrote each row.
| Verb | What it does | Where the row lands |
|---|---|---|
swarm revalidate | Repairs blocked agent_runs in place (BLOCKED → complete); flips the wave back to collected when every latest agent_run reaches complete | agent_state_events (+ wave row when the wave flips) with revalidate: reason prefix |
swarm rewind | Restores the working tree to a save-point tag AND lawfully aborts orphaned in-flight waves + agent_runs (status → aborted_for_rewind) | wave_state_events + agent_state_events with rewind: reason prefix |
swarm redrive | Resumes an in-flight wave at the same wave_id; completed receipts survive byte-identical, only eligible failed/unstarted agent_runs are made re-dispatchable (status → dispatched) | wave_state_events + agent_state_events with redrive: reason prefix |
The product thesis: Rewind erases, Redrive resumes. They are not the same verb. A rewind is what you do when the slice itself was a wrong turn and you want the working tree back at the save-point; the in-flight rows survive as forensic evidence with status aborted_for_rewind. A redrive is what you do when the slice was right but a subset of agents failed mid-flight; completed receipts are immutable and only the failure tail gets re-dispatched. Revalidate is the third sibling — narrowly, “the agent’s output JSON was corrected on disk; ratify it through the override path so the audit trail names who said it was good and why.”
The shared discipline
Section titled “The shared discipline”Every verb is built on the same four contracts:
- Dry-run by default;
--applyrequired to mutate. The dry-run renders the full plan (what would change, what would be preserved, what would be refused) so the operator previews the effect before any state leaves disk. Same posture aspg_resetwal -norkubectl --dry-run=server. --reason "<text>"is non-optional. Every verb refuses without a non-empty reason string. The text is recorded verbatim in the matching_state_eventsrow, prefixed with the verb name (revalidate:,rewind:,redrive:) so the audit trail is greppable by intent.- Zero raw SQL. All status mutations go through the canonical
transitionAgent/transitionWavehelpers inpackages/dogfood-swarm/lib/state-machine.jsandpackages/dogfood-swarm/lib/wave-state-machine.js. RawUPDATE waves SET status = …would skip the audit row and corrupt the chain. - Single transaction. Each verb wraps its DB writes in one
better-sqlite3transaction so a partial-write cannot leave the control plane in a torn state (e.g. agentscompletebut wave stillfailed). Rewind also runsgit reset --hardBEFORE the DB transaction so a partial DB-side failure surfaces loudly as “tree at target, DB tx failed — inspect manually.”
swarm revalidate
Section titled “swarm revalidate”Repair the latest agent_run for one or more domains when its output is now correct on disk but the control plane has it stuck in invalid_output or ownership_violation.
Usage: swarm revalidate <run-id> --reason "<text>" --domain=name:path [--domain=name:path ...] [--apply]Behavior (per packages/dogfood-swarm/commands/revalidate.js):
--reason "<text>"and at least one--domain=name:pathare required; the verb throws on either missing.- Each
--domainvalue names a domain on the run and a path to its corrected output JSON. The output is re-run through the same Ajv envelope gate, phase-specific legacy validator (audit / feature / amend), and ownership check thatswarm collectuses — the verb does not have its own validator; the gate is the same gate. - On pass per domain, the verb calls
transitionAgent(db, agent_run_id, 'complete', reason, /* override */ true). The override branch is required because the source statuses (invalid_output,ownership_violation) are inBLOCKED_STATUSES; the canonicalcanTransitionlaw would otherwise refuse them. - After every latest agent_run on the wave is
complete, if the wave is currentlyfailed, the same transaction callstransitionWave(db, wave.id, 'collected', 'revalidate: <reason>', true). The wave-level override is required becausefailedis a BLOCKED wave status. - Partial repair (some agents repaired, others still BLOCKED) keeps the wave in
faileddeliberately. The applied-summary spells out the “Wave NOT recovered” clause with a count of still-blocked agents so the operator’s natural read of “Repaired: N” + clean exit code cannot be misread as “wave fully recovered.”
swarm rewind
Section titled “swarm rewind”Restore the working tree to a named save-point AND lawfully tear down any orphaned in-flight rows that pre-date the save-point.
Usage: swarm rewind <save-point-tag> --reason "<text>" [--apply] [--force] [--force-arbitrary-ref]Behavior (per packages/dogfood-swarm/commands/rewind.js):
<save-point-tag>must matchswarm-save-*by default;--force-arbitrary-refopts into arbitrary refs (tags, branches, commits). Destructive verb; the conservative default lives on the safer surface.--reason "<text>"is required and non-empty. The text is prefixed withrewind:and recorded in everywave_state_eventsandagent_state_eventsrow this verb writes.- Uncommitted changes in the working tree are refused without
--force. This mirrorsgit reset --hard’s documented destructive surface — silently discarding uncommitted work would be the worst kind of operator surprise. - Order of operations: validate →
git reset --hard <tag>→ DB transaction. Git cannot roll back inside a SQL transaction; the reverse order would leave the tree reset if SQL succeeded and git failed. A partial DB-side failure surfaces as astate_split: trueerror in the report. - Affected rows transition to the new
aborted_for_rewindterminal status (parallel agent + wave statuses; introduced for this verb). Reusingfailedwould be semantically wrong (the wave was not a logic failure; it was operator-aborted); a same-status no-op write would erase the lifecycle signal entirely. - Terminal rows (advanced waves, complete agent_runs, prior
aborted_for_rewindentries) survive byte-identical. The plan summary names the preserved count alongside the affected count — “rewind erases the failure tail but preserves history” is the operator’s mental model and the surface reflects it.
swarm redrive
Section titled “swarm redrive”Resume an in-flight wave at the same wave_id. Completed work survives byte-identical, only the failure tail is re-dispatched. Step Functions Redrive semantics on the swarm control plane.
Usage: swarm redrive <wave-id> --reason "<text>" [--apply]Behavior (per packages/dogfood-swarm/commands/redrive.js):
<wave-id>must be a positive integer matchingwaves.id.--reason "<text>"is required and non-empty; the text is prefixed withredrive:and lands on every_state_eventsrow this verb writes.- Wave-level eligibility:
advancedandaborted_for_rewindare TERMINAL and refused (promotion is immutable; aaborted_for_rewindwave is run-a-fresh-wave territory).collected,verified,dispatcheduse the normaltransitionWavepath;faileduses the override branch because it is BLOCKED. - Per-agent_run eligibility table:
| Source status | Outcome | Reason |
|---|---|---|
complete | PRESERVED | Receipt is immutable; appears in the report as informational |
pending, dispatched | ELIGIBLE | Redriven to dispatched (source == target → audit row only) |
failed | ELIGIBLE | Redriven via override (BLOCKED source) |
timed_out | ELIGIBLE | Normal path; timed_out → dispatched edge already exists |
invalid_output | REFUSED | Use swarm revalidate instead (wrong verb) |
ownership_violation | REFUSED | Operator unblocks ownership first; not a redrive case |
aborted_for_rewind | REFUSED | Terminal; run a fresh wave at the same phase |
running | REFUSED | Let the timeout policy fire, then redrive the resulting timed_out |
- Receipt-byte-identity contract. Before
--applythe verb computes a sha256 hash over the identity-carrying fields (status, output_path, completed_at) plus the fullagent_state_eventschain for everycompleteagent_run on the wave. After--applyit recomputes the same hash and asserts equality. A future regression that accidentally writes to a complete row trips this gate and the verb throws. serial_verify_requiredon the wave is preserved across redrive. The flag marks operator discipline (“this wave was dispatched with--skip-verify; coordinator owes one serialnpm run verifyagainst the cumulative tree”); resetting it would silently re-arm the bug the flag exists to prevent.
Example session
Section titled “Example session”A walked example, end-to-end. The repo has a dispatched wave (id 42) that finished with two agents reporting invalid_output. The save-point tag is swarm-save-1789450000.
# Step 1: inspect the wave's state.swarm status <run-id># Output names blocked agents and points at swarm revalidate.
# Step 2: try revalidate first (dry-run).swarm revalidate <run-id> \ --reason "fix typos in output JSON" \ --domain=backend:outputs/backend.json \ --domain=tests:outputs/tests.json# Dry-run renders the plan; if every agent passes the# validators and would transition cleanly, --apply commits.
# Step 3: revalidate apply.swarm revalidate <run-id> \ --reason "fix typos in output JSON" \ --domain=backend:outputs/backend.json \ --domain=tests:outputs/tests.json \ --apply# Wave flips from failed to collected in the same transaction.
# Alternate path — wave unsalvageable; rewind erases.swarm rewind swarm-save-1789450000 \ --reason "slice abandoned; wrong architectural direction"# Dry-run. Inspect, then re-run with --apply.swarm rewind swarm-save-1789450000 \ --reason "slice abandoned; wrong architectural direction" \ --apply# Tree reset to save-point; in-flight rows aborted_for_rewind.
# Alternate — wave right but two agents failed; redrive resumes.swarm redrive 42 \ --reason "transient net failures on 2 agents; re-dispatch tail"# Dry-run names what's preserved vs eligible vs refused.swarm redrive 42 \ --reason "transient net failures on 2 agents; re-dispatch tail" \ --apply# Completed receipts unchanged; failure tail back at dispatched.
# In every case, audit the chain afterwards.swarm history 42The audit chain is the canonical record. Direct DB intervention is universally last-resort across the industry (pg_resetwal, etcdctl snapshot restore, Modern Treasury, Stripe Ledger) — every one of them mediates state mutation through tooled commands that emit an event-sourced audit row alongside the UPDATE. The Three R’s are this repo’s expression of that pattern.
Cross-references
Section titled “Cross-references”- Wave transition history inspection:
swarm history <wave-id> - Agent run lifecycle and the BLOCKED override primitive: State Machines
- CLI error codes surfaced by these verbs: Error Code Reference