Core concepts
This page covers the vocabulary the rest of the handbook assumes: who plays, how puzzles move through the catalog, how reliability is measured, when a puzzle is “real,” how an attempt is scored, and how ai-crucible measures its own prompt sensitivity.
The five roles
Section titled “The five roles”Every participant implements one uniform interface (name + an async act), and two invariants are
structural, not aspirational: all model I/O routes through a single injected choke point, and no role
may read the competitive chrome.
| Role | What it does | Tools? |
|---|---|---|
| Designer | Generates and refines puzzle candidates aimed at documented capability gaps. The creative role. | None — pure model I/O. |
| Solver | Attempts the puzzle inside the sandbox, under a kernel-enforced budget. | The only role granted sandbox tools (exec / read_file / write_file). |
| Critic | An optional adversarial third agent (debate-style). Interface-complete but default-off. | None — string-in/string-out, no environment access. |
| Judge | One panelist in a cross-family panel; scores the terminal world-state, not the chat text. | None — string-in/string-out. |
| CohortSolver | A sibling Solver in a pass^k cohort, so the kernel can wire k of them per puzzle. | Same tool grant as Solver. |
The Critic is reserved but off by default: its measured value is narrow — it helps mainly on hard, ambiguous items — while it costs dramatically more tokens and can actually regress a strong Solver through conformity (the Solver capitulating to a confident-but-wrong critique). Its message schema is defined up front so it can be enabled later on high-disagreement puzzle classes without breaking the contract.
The catalog lifecycle: Lab → Arena → Regression
Section titled “The catalog lifecycle: Lab → Arena → Regression”Puzzles live in a three-tier catalog and move in one direction:
- Lab — live and in-iteration. A Designer and Solver are actively working a candidate; its difficulty is still being characterized.
- Arena — graduated and active. A Lab puzzle is promoted once it clears the graduation rule and is validated as fair/difficult by input from outside the Solver’s own model family.
- Regression — historical. When the frontier moves and a puzzle becomes reliably solvable, it is demoted to Regression, never deleted, and must keep passing forever.
Demoting rather than deleting is deliberate: the catalog becomes a capability-evolution timeline. When a stronger model ships, you demote what it consistently solves and run a fresh “find what it still fails on” Lab cycle, so the catalog stays ahead of the frontier by construction.
Reliability: pass^k, not pass@k
Section titled “Reliability: pass^k, not pass@k”AI Crucible reports consistency, not best-of-k. pass^k is the probability that all k
independent trials succeed — it decays exponentially and exposes unreliability that pass@k (which
rewards a single lucky success out of k) hides. This matters because frontier failures are sticky:
a large fraction of a model’s failures on agentic tasks recur identically across re-runs, so a metric
that hides that recurrence overstates real-world reliability.
AI Crucible records pass^k natively — k sibling attempts collected under one puzzle history, rather
than k samples folded into one log — so the consistency view is first-class.
Wilson intervals at small N
Section titled “Wilson intervals at small N”AI Crucible grades a handful of attempts per puzzle, never hundreds. At that scale the classical
normal-approximation (Wald) confidence interval is inadmissible — it under-states uncertainty and
degenerates at the 0-of-N and N-of-N boundaries. AI Crucible uses the Wilson score interval (and the
conservative Clopper–Pearson exact interval when it must not understate uncertainty), both of
which stay sane and informative at small N and never escape [0, 1].
The graduation rule
Section titled “The graduation rule”A Lab puzzle graduates to Arena only when its Wilson 95% interval shows it is neither trivial nor impossible — both ruled out in a single test:
0.10 ≤ wilson_lower AND wilson_upper ≤ 0.90- The lower-bound clause kills impossible puzzles: a 0-of-20 puzzle has a low Wilson lower bound — no evidence anyone can clear even 10%.
- The upper-bound clause kills trivial puzzles: a 20-of-20 puzzle has a high Wilson upper bound — no evidence it ever stumps anyone.
A mid-rate puzzle (around 5-of-20) satisfies both and graduates. Using Wilson rather than Wald here is load-bearing precisely because the interesting candidates sit near the 0/N and N/N boundaries where the normal interval breaks down.
The scoring gate: conjunctive, with a tiebreaker
Section titled “The scoring gate: conjunctive, with a tiebreaker”A weighted sum of “solve + elegance + novelty − penalties” is Goodhart-fragile: a capable agent can trade a little true performance for a lot of proxy reward. So ai-crucible separates the pass/fail decision from the ranking number.
The hard gate is a conjunction — it opens only when all of these hold:
- The task oracle is satisfied (
solved) and nothing required-to-still-work broke (no_regression) — the solved-and-no-regression pattern. - The graded solve quality meets the puzzle’s
point_threshold. - No critical (adversarial-flavor) penalty fired — e.g. fetching the answer key or tampering with the verifier. A single critical penalty closes the gate even on an otherwise perfect solve.
- Tool calls stayed within budget.
- Wall-clock time stayed within budget.
- If a novelty bonus was claimed, the cross-family panel validated it — you don’t get to assert your own bonus.
Only within that passing region does the net score (solve + elegance + novelty − penalties)
matter, and it is used for leaderboards, never to open the gate. Two design guards keep any single
axis from becoming a gaming magnet: elegance is scored as a ratio (canonical-call-count over
calls-actually-used, so hard puzzles aren’t penalized against trivial ones) and is capped at 30% of
the solve reward, while the novelty bonus is capped at 50% and applies only when panel-validated.
A failing attempt scores 0.0, but the full breakdown of which conditions failed always rides in
the score metadata, so a failure is legible rather than a bare zero.
Framing as a measured arm
Section titled “Framing as a measured arm”Prompt framing is not a fixed choice in ai-crucible — it is a first-class measured variable. The same puzzle can be run under three arms, and ai-crucible characterizes its own prompt-effect as a built-in diagnostic:
neutral— the task and legitimate task feedback only.self_referential(default) — a “beat your own previous best” mastery framing, applied deployment-plausibly. (Self-referenced mastery goals reliably out-perform social-comparison goals, which is why this is the default.)social_standings— the old peer-standings/leaderboard framing, retained only as a measured arm and rendered as chrome, never as the default scored context.
Running a puzzle across arms turns “is this model sensitive to competitive framing?” into a measured number instead of a guess — and the same machinery doubles as the boundary-validation probe described in the Security model.