ai-crucible a measurement instrument that happens to be fun.
One Claude session crafts puzzles targeting real, currently-observed capability gaps; another attempts them. A policy-enforced kernel scores against a hidden oracle and curates a Lab → Arena → Regression catalog — rewarding elegance and novelty, penalizing answer-bypass.
Clone
git clone https://github.com/dogfood-lab/ai-crucible
Install
uv sync --extra dev --extra stats
Verify
bash verify.sh
What makes it different
A diagnostic instrument, not a leaderboard.
Capability, not "cheating"
Distinguishes elegance and novelty (rewarded) from answer-bypass (penalized). Lateral thinking is a capability to measure, not a vice to punish.
The instrument measures itself
Prompt framing is a first-class measured arm — the kernel runs the same puzzle under neutral / self-referential / social framings and reports its own prompt-effect.
A sealed boundary
Motivation and measurement never share a context window; the hidden oracle is graded out-of-band by a different model family with the agent’s reasoning hidden.
Reliability by consistency
pass^k (all k trials succeed), Wilson intervals, and cross-family judge panels — distributions, not point estimates.
Quick start
Set up
uv sync --extra dev --extra stats Run the suite
uv run pytest --cov=ai_crucible One-command gate
bash verify.sh