Ship the agent's code review, then ship an agent to doubt it

The first time the-bureau got stuck in an infinite loop re-spawning its own verifier, I learned something useful: oversight that runs on the same reasoning engine as the work it’s overseeing is not oversight. It’s a mirror.

This is the architectural sequel to an earlier post about trust and testing. That one diagnosed the problem — an agent that reports success sounds exactly like one that achieved it, and if your verifier is the same model checking its own output, you’ve just moved the completion signal one hop and called it a solution. This one is about what I built instead.

One pass is not enough

The original verification mechanism in the-bureau was called __verify__. When a task graph completed, the engine would spawn a Haiku agent, hand it a prompt, and ask it to evaluate whether the work was done. This was, in retrospect, the wrong abstraction on every axis.

The Haiku agent had no structured output format, no per-check granularity, no retry semantics, and no separation from the context that produced the work. It was a black box that returned a yes or a no, and if it was wrong about the yes, nothing downstream would catch it. It also couldn’t distinguish between “the tests pass” (checkable with an exit code) and “the code is correct” (requires judgment). Those are different questions that need different tools.

The deeper problem: a model asked to verify its own chain of reasoning will tend toward confirmation. Not maliciously — it’s the same completion-seeking behavior that causes “all tests passing” reports when three tests have been quietly skipped. Ask the same process to both do the work and validate the work, and the validation will be shaped by the same priors that produced the result.

Decompose the work, then gate the completion

The-bureau runs work as task graphs. A graph declares a set of tasks with dependencies; the engine dispatches them as parallel Kubernetes workers once their dependencies are met. Each worker is a Claude CLI session running in its own pod, with its own context, its own tools, and no shared state with other workers except what the orchestrator explicitly passes.

When all tasks in a graph are terminal, the engine does not immediately mark the graph completed. It enters a validating state. Completion is gated. The state machine has 24 legal transitions and running → completed is not one of them — you go running → validating first, and only a passing verdict gets you to validated.

stateDiagram-v2
    running --> validating: validate
    validating --> completed: complete
    validating --> failed: fail

That one extra state is the architectural commitment. “Done” is a claim. validated is evidence.

The criterion engine: doubt without LLMs

Acceptance criteria are typed. There are four kinds:

command — runs a bash command inline; passes on exit 0
script — runs a named plugin from plugins/criteria/; same exit-code semantics
assertion — in-process check: file_exists, file_not_empty, regex:pattern:path, json_valid, exit_zero:cmd
agent — subjective evaluation; delegates to a separate agent

The first three run as plain child processes. No LLM. The exit code is the verdict. They can’t hallucinate pass. The bundled typecheck-workspace plugin runs npm run typecheck --workspace=$WORKSPACE and returns whatever the compiler says.

{
  "project": "the-bureau",
  "tasks": [{ "id": "refactor", "role": "coder", "prompt": "..." }],
  "acceptanceCriteria": [
    {
      "name": "types-clean",
      "type": "script",
      "check": "typecheck-workspace",
      "inputs": { "WORKSPACE": "packages/core" },
      "onFail": "fix",
      "fixRole": "debugger",
      "maxRetries": 1
    },
    {
      "name": "tests-pass",
      "type": "command",
      "check": "npm test --workspace=packages/core",
      "onFail": "retry",
      "maxRetries": 2
    },
    {
      "name": "review",
      "type": "agent",
      "check": "Verify the refactor preserves all existing API contracts",
      "onFail": "fail"
    }
  ]
}

The onFail semantics matter: fail stops immediately; retry re-runs up to maxRetries + 1 attempts; fix dispatches a fixer agent (default role: debugger), waits for it to resolve the issue, then re-evaluates. Each criterion emits its own lifecycle event — criterion_passed, criterion_failed, criterion_skipped — so you can see exactly which check blocked completion and why.

The flow when all tasks finish:

flowchart TD
    A[all tasks done] --> B{acceptanceCriteria?}
    B -- no --> Z[completed]
    B -- yes --> C[status = validating]
    C --> D[split by type]
    D --> E[command / script / assertion]
    D --> F[agent criteria]
    E --> G[CriterionEngine.evaluateAll]
    G --> H{all inline passed?}
    F -- present --> I[child validation graph]
    F -- none, H yes --> J[validated]
    F -- none, H no --> K[validation_failed]
    I --> L[code-reviewer eval tasks]

Agent criteria: a separate judge

When the criteria include agent types, the engine declares a child validation graph — completely new workers, new pods, with no memory of the work they’re judging except what’s explicitly handed to them.

The default verifier role is code-reviewer. Here’s the relevant part of its agent profile, stripped of bureau-specific tooling:

You are a senior code reviewer. Your anchor: **project conventions beat
textbook conventions.** If the codebase uses a pattern you disagree with,
respect it unless it introduces a correctness or security issue.

## Pre-task investigation protocol (mandatory, no exceptions)

1. **Read project conventions.** Check CLAUDE.md, README.md, CONTRIBUTING.md,
   and linter/formatter configs. These are your source of truth for style.
2. **Read sibling files.** For every changed file, read at least two other
   files in the same directory. This reveals local naming, structural
   conventions, and abstraction style.
3. **Trace imports one level deep.** Verify the change respects existing
   interface contracts — argument types, return shapes, error protocols.
4. **Identify change intent.** Run `git log` on the commits. Understand what
   the author was trying to accomplish before judging how they did it.

## Before writing any finding

1. Is this a real bug, or a style preference? If style — drop it.
2. **What is my confidence? High (>80%) = finding. Medium (50–80%) = question.
   Low (<50%) = drop it entirely.**
3. Would I flag this in my own code? If you'd give yourself a pass, give the
   author a pass.
4. Am I suggesting a concrete fix? "This is confusing" is not a finding.
5. Does this help the author, or only demonstrate my knowledge? If the latter,
   drop it.

## Hard limits

- No personal style preferences over project conventions.
- No rewriting code for the author — suggest the fix, let them implement it.
- No nitpicking formatting a linter already handles.
- No blocking a merge over subjective disagreements. Blockers are for
  objective correctness issues only.

The code-reviewer hasn’t seen the implementation context, the intermediate attempts, the scaffolding that was built and torn down. It sees the diff and the criteria and nothing else. That isolation is the point. An agent instructed to doubt, reviewing work it played no part in producing, will find things the original worker missed — because it doesn’t share the blind spots.

The pattern generalises

The same write → verify → rework loop shows up everywhere in the stack, not just code.

The-bureau’s self-improvement loop fires after qualifying graphs complete and analyzes the session transcript for gaps in the bureau’s own codebase, routing findings to git — new issues, documentation gaps, quality findings targeting the bureau’s own code. It’s not a quality gate; it doesn’t block anything. It’s the meta-pass: running after the pass, asking “what did we miss?”

Documentation follows the same adversarial shape. A document-project run dispatches a documentarian agent to write a subsystem note and produce a claim manifest — a structured list of every factual assertion the note makes. Then a separate claims-verifier agent (which never sees the documentarian’s reasoning, only the manifest and the corpus) checks each claim against the actual source: code lines, commits, issues, test names. Unverified claims come back as rework. Claims that survive three rounds of rework get marked status: disputed. The docs in the bureau’s corpus that are marked status: verified have been through this — every sentence in them has a citation to a source that exists.

The verifier never sees the writer’s reasoning — only the manifest and the corpus. Keep it that way; it is the point.

That rule applies equally to code, documentation, and any other artifact you want honest oversight on. The moment the verifier shares context with the work, it shares the blind spots too.

What this costs

It’s not cheap. Every validated graph runs: the initial workers, the inline criterion checks, a code-reviewer child graph for agent criteria, and a retro analysis afterward. That’s multiple Claude sessions per graph completion. Token costs are real and they’re not going down.

The honest answer is that the confidence is worth it. We’ve been shipping more features and fighting fewer fires precisely because the verification is honest — it comes from a process that cannot see what the original worker believed was true.

You can run a single-agent review pass and get something. It’ll look like verification. It’ll produce structured output and a confidence score and a summary. And then one of the findings it was most confident about will turn out to be wrong, because the model that found the issue is the same model that missed why it isn’t one.

The expensive version is the one that catches that.

The guardrails have guardrails. That’s not paranoia. That’s what it takes to ship things you trust.