Operating Guide
Weekly: Freshness Review
Section titled “Weekly: Freshness Review”- Run the portfolio generator:
node packages/portfolio/generate.js - Check
reports/dogfood-portfolio.json— inspect thestalearray - Repos with
freshness_days > 14get a warning flag - Repos with
freshness_days > 30are in violation — re-run the scenario or document the block - Inspect the
unknown_freshnessarray — entries here have unparseablerecord.timing.finished_attimestamps (computeFreshnessDaysreturnednull, route added by F-246817-005). Without this step the entries silently bypass the freshness review forever. For each entry, identify the source repo fromrepo, fix the submission emitter to produce a well-formed ISO 8601 timestamp, and re-dispatch.
This page documents the record classification + portfolio bucket state machine. For the finding-review state machine (candidate → reviewed → accepted), see Intelligence Layer. For the wave-classification state machine (new/recurring/fixed/unverified), see the State Machines reference.
Monthly: Policy Calibration
Section titled “Monthly: Policy Calibration”- Review all
warn-onlyandexemptrepos for promotion torequired - Check
review_afterdates — past-due repos must be evaluated - Promotion criteria: repo has passed dogfood at least twice on required-equivalent scenarios
- If a repo can’t promote, document why and set a new
review_afterdate
On Failure
Section titled “On Failure”- Investigate root cause — is it the scenario, the repo, or the infrastructure?
- Fix the scenario or repo, not the governance system
- Update rollout doctrine only if the failure reveals a genuinely new seam
- Never weaken enforcement to make a failure go away
New Repo Onboarding
Section titled “New Repo Onboarding”- Create a policy YAML in
policies/repos/<org>/<repo>.yaml(where<org>isdogfood-labormcp-tool-shop-org) withenforcement.mode: required - Identify the correct surface type from the 8 defined surfaces: cli, desktop, web, api, mcp-server, npm-package, plugin, library
- Define required scenarios and freshness thresholds in the policy under
surfaces.<surface> - In the source repo, create
dogfood/scenarios/<scenario-id>.yamlfollowing the scenario contract - Create a dogfood workflow in the source repo (
.github/workflows/dogfood.yml) that builds a submission and dispatches to testing-os - The source workflow should use the submission builder (
packages/report/build-submission.js) to produce a canonical submission - Add the
DOGFOOD_TOKENsecret to the consumer repo — required for the dispatch step. Mint a fine-grained PAT (or GitHub App token) withcontents: writescoped todogfood-lab/testing-os, then add it under the consumer repo’s Settings → Secrets and variables → Actions asDOGFOOD_TOKEN. Without this, the workflow runs green but skips dispatch with aDOGFOOD_TOKEN not setwarning and no record reaches testing-os. See GitHub docs on fine-grained PATs. - Run the workflow, verify ingestion produces an accepted record, confirm the repo appears in
indexes/latest-by-repo.json - Run
npx @mcptoolshop/shipcheck dogfood --repo <org>/<repo> --surface <surface>on the source repo to confirm Gate F passes (thedogfoodsubcommand is the freshness/Gate F check;auditis the SHIP_GATE.md tracker for hard gates A–D)
Filesystem Requirements
Section titled “Filesystem Requirements”testing-os assumes POSIX link(2) semantics for atomic publication in the file-lock CAS (packages/findings/lib/file-lock.js, shipped in v1.1.5). All major dev/CI filesystems support this:
| Filesystem | Hardlinks | Status |
|---|---|---|
| APFS (macOS) | yes | supported |
| HFS+ (legacy macOS) | yes | supported |
| ext4 (Linux) | yes | supported (CI baseline) |
| NTFS (Windows) | yes | supported |
| exFAT | no | NOT supported — linkSync throws ENOTSUP |
| FAT32 | no | not supported |
The failure mode on exFAT is loud: callers see ENOTSUP: operation not supported on socket, link ... from findings/lib/file-lock.js:atomicCreateLock. Production code paths that write to a review log on exFAT will throw at the user rather than silently degrade.
Common operator gotcha: cross-platform external SSDs (e.g., Samsung T7/T9, SanDisk Extreme) are typically formatted exFAT for Windows + macOS interop. Clone testing-os to local APFS/HFS+/ext4 instead. The Session G validation matrix at docs/m5-validation-2026-04-29.md walks through the full APFS-vs-exFAT comparison.
Running Ingestion Locally
Section titled “Running Ingestion Locally”The ingestion CLI (packages/ingest/run.js) requires an explicit --provenance flag:
# Production (in CI) -- verifies source runs via GitHub APInode packages/ingest/run.js --file submission.json --provenance=github
# Local development / testing -- uses a stub that always confirmsnode packages/ingest/run.js --file submission.json --provenance=stubThe --provenance=stub flag is blocked in CI environments (CI=true or GITHUB_ACTIONS=true) as a safety measure. In CI without an explicit flag, the ingestion pipeline defaults to GitHub provenance and requires GITHUB_TOKEN.
For the full per-verb reference of every swarm command (init / domains / dispatch / collect / verify / advance / status / revalidate / rewind / redrive / history and the other 10 verbs), see the swarm CLI reference. The reference is organised by verb in the order an operator typically reaches them — start with the verbs documented in this Operating Guide, then consult the reference for one-line synopses of the rest.
CDN Cache Timing
Section titled “CDN Cache Timing”raw.githubusercontent.com caches for 3-5 minutes. After a fresh ingestion, Gate F may read stale data. This is operational, not a product defect. Wait 3-5 minutes and retry.
The handbook itself is served via GitHub Pages, which is also CDN-backed — handbook edits typically take a few minutes to propagate after pages.yml deploys.
Corrupted Record Recovery
Section titled “Corrupted Record Recovery”packages/ingest/rebuild-indexes.js returns { accepted, rejected, corrupted, skipped }. The corrupted[] array carries { path, error } for any record whose JSON could not be parsed. The rebuild does not fail on corruption — it skips the record, logs [rebuild-indexes] corrupted record skipped: <path> — <error> to stderr, and continues. The skipped record is excluded from latest-by-repo.json, so the index is silently incomplete until repaired.
Recovery procedure:
- Identify corrupted records — either from the
corrupted[]return array or by grepping the rebuild stderr forcorrupted record skipped. - For each
corrupted[].path:- Repair the JSON if the cause is obvious (truncation, encoding bleed) and re-run
node packages/ingest/rebuild-indexes.js. - Or, if the record cannot be salvaged: read the
run_idfrom the path, re-dispatch the source workflow to produce a clean record, then delete the corrupted file and rebuild.
- Repair the JSON if the cause is obvious (truncation, encoding bleed) and re-run
- Verify the record now appears in
indexes/latest-by-repo.json.
The same rebuild-indexes call also returns skipped[] for records that loaded but lacked a run_id — same recovery shape (re-dispatch with a complete submission), different root cause.
Recovering from a locked control-plane DB
Section titled “Recovering from a locked control-plane DB”The swarm control plane is a single SQLite file at swarms/control-plane.db. Every file-backed connection opens in WAL journal mode with a 5-second busy_timeout (BUSY_TIMEOUT_MS in packages/dogfood-swarm/db/connection.js). That combination is the design for parallel swarm invocations:
- Multiple readers (
swarm status,swarm runs) + one writer: safe under WAL, no waiting. - Two concurrent writers (e.g.
swarm collectwhile another process runsswarm advance): each writer briefly waits up to 5s for the other to release its transaction, then proceeds. Swarm transactions are small (most under 50ms), so writer-vs-writer contention normally self-resolves well inside the window and you never see an error.
When a lock does surface, it appears as a SQLITE_BUSY / “database is locked” error (and, from swarm collect, as the COLLECT_UPSERT_FAILED code — see the Error Code Reference). This means a writer held the DB for longer than 5s. The cause is almost always another process still holding a write transaction, not file corruption — so the fix is to release that process, not to touch the DB file.
Recovery procedure:
- Find the stuck process. List running
swarmprocesses and look for one that is hung mid-command (a crashed agent that never released its transaction, an orphanedswarm collect/swarm advance, or an editor/sqlite3shell you left open on the file):- macOS/Linux:
ps aux | grep -i swarm(andlsof swarms/control-plane.dbto see every process holding the file open). - Windows (PowerShell):
Get-Process | Where-Object { $_.Path -like '*node*' }, or use Resource Monitor’s “Associated Handles” search forcontrol-plane.db.
- macOS/Linux:
- Stop it. Terminate the stuck process so it releases its lock (
kill <pid>/Stop-Process -Id <pid>). A cleanswarmprocess will release on its own once its transaction commits; only kill one that is genuinely hung. - Re-run the command that hit the lock. With the holder gone, the retry acquires the write lock immediately.
swarm collectin particular is idempotent at the upsert level (perCOLLECT_UPSERT_FAILED), so re-running it after clearing the lock is safe.
Do NOT delete swarms/control-plane.db, run PRAGMA-level surgery, or hand-edit the WAL/SHM sidecar files to “unstick” a lock — the lock is held by a live process, and removing the file discards committed swarm state. If a lock persists after every swarm process is confirmed dead, the WAL sidecar (control-plane.db-wal) may simply need a clean checkpoint, which the next normal openDb performs automatically; re-running any swarm verb is the recovery, not file deletion. If contention is chronic (you regularly wait >5s), raise BUSY_TIMEOUT_MS rather than serialising externally.
Error Codes
Section titled “Error Codes”For the structured error codes that surface from ingest and dogfood-swarm CLIs (RECORD_SCHEMA_INVALID, DUPLICATE_RUN_ID, ISOLATION_FAILED, COLLECT_UPSERT_FAILED, CONTROL_PLANE_SCHEMA_TOO_NEW, STATE_MACHINE_*), see the Error Code Reference.
Rollout Doctrine
Section titled “Rollout Doctrine”10 rules learned from real failures during expansion:
- Surface truth — the scenario must match the real product surface
- Build output truth — verify the actual build artifact, not just source
- Protocol truth — use the real protocol the product exposes
- Runtime truth — exercise in the real runtime environment
- Process truth — test the actual process lifecycle
- Dispatch truth — verify the dispatch mechanism works end-to-end
- Concurrency truth — handle concurrent ingestion gracefully
- Verdict truth — source proposes, verifier confirms or downgrades
- Evidence truth — evidence must be machine-verifiable
- Entrypoint truth — use the real CLI interface, not assumed flags