Diagnostic adversarial game

ai-crucible a measurement instrument that happens to be fun.

One Claude session crafts puzzles targeting real, currently-observed capability gaps; another attempts them. A policy-enforced kernel scores against a hidden oracle and curates a Lab → Arena → Regression catalog — rewarding elegance and novelty, penalizing answer-bypass.

View on GitHub Read the Handbook

Clone

git clone https://github.com/dogfood-lab/ai-crucible

Install

uv sync --extra dev --extra stats

Verify

bash verify.sh

What makes it different

A diagnostic instrument, not a leaderboard.

Capability, not "cheating"

Distinguishes elegance and novelty (rewarded) from answer-bypass (penalized). Lateral thinking is a capability to measure, not a vice to punish.

The instrument measures itself

Prompt framing is a first-class measured arm — the kernel runs the same puzzle under neutral / self-referential / social framings and reports its own prompt-effect.

A sealed boundary

Motivation and measurement never share a context window; the hidden oracle is graded out-of-band by a different model family with the agent’s reasoning hidden.

Reliability by consistency

pass^k (all k trials succeed), Wilson intervals, and cross-family judge panels — distributions, not point estimates.

Quick start

Set up

uv sync --extra dev --extra stats

Run the suite

uv run pytest --cov=ai_crucible

One-command gate

bash verify.sh