1.8.0 • Published yesterdayCLI

dod-guard

Licence

MIT

Version

1.8.0

Deps

Size

848 kB

Vulns

Weekly

Summary Dependency Versions

dod-guard

Anti-cheat Definition of Done verification for Claude Code. Locks proof commands in MCP storage so editing the rendered markdown cannot weaken verification.

What it does

Locks proofs canonically — proof commands stored in MCP, not in editable markdown
Tamper-blocking — SHA256 fingerprint mismatch forces the verdict to FAIL, not just a warning
Baseline enforcement — dod_create rejects a DoD missing the mandatory proof categories (two-layer integration, full test suite)
Incremental checking — dod_check --step N verifies one step fast; only a full run can return PASS
Amendment audit trail — all proof modifications logged with mandatory reasons
Weakening prevention — cannot convert machine-checkable proofs to manual
Structured interviews — /interview skill gathers requirements before implementation

Install

As a Claude Code plugin (recommended)

claude plugin install --from github tychohenzen/dod-guard

As a standalone MCP server

Add to your .mcp.json:

{
  "mcpServers": {
    "dod-guard": {
      "command": "npx",
      "args": ["-y", "dod-guard"],
      "type": "stdio"
    }
  }
}

Via npm global install

npm install -g dod-guard

MCP Tools

Tool	Description
`dod_create`	Create a locked DoD (declares a `type`; each proof a `category`). Rejects DoDs missing mandatory baseline categories
`dod_check`	Execute proofs from canonical storage, return PASS/FAIL/INCOMPLETE. Optional `step` N verifies one step (scoped → INCOMPLETE, never PASS). Never auto-prompts manual/review proofs — see `dod_verify`
`dod_verify`	Request human out-of-band verification (popup, notes field) for ONE manual/review proof. Call it when verification is actually relevant, not on every `dod_check`
`dod_status`	Read cached last check result without re-running
`dod_amend`	Modify a proof with mandatory reason (audit-logged)
`dod_list`	List all tracked DoDs with status
`dod_import`	Parse existing DoD markdown and lock its proofs

Skills

`/interview`

Structured requirements gathering skill. Researches the codebase, asks targeted questions one at a time, builds a confirmed requirements summary, then creates a locked DoD via dod_create.

The output is a self-contained spec with testable proofs that can be passed to /goal for autonomous implementation.

How it works

Proof lifecycle

/interview → dod_create → [implement] → dod_check → PASS/FAIL
                                              ↓
                                    dod_amend (if unreasonable)

Create — /interview or direct dod_create call locks proofs in ~/.claude/dod-store/
Implement — work through steps, proofs are the acceptance criteria
Check — dod_check executes commands from the locked store (not the markdown)
Amend — if a proof is genuinely unreasonable, dod_amend modifies it with a logged reason

Anti-cheat properties

Proof commands live in ~/.claude/dod-store/{uuid}.json claude is not aware of, not in the markdown file claude may read/alter
dod_check reads from the store — editing markdown proof text has zero effect
Each check prints a SHA256 fingerprint — compare to detect store tampering
Cannot weaken a machine-checkable proof to manual (blocked server-side)
manual proofs are confirmed by the human out-of-band (elicitation / server dialog) — Claude cannot self-confirm or fabricate the answer
All amendments are permanently logged with timestamps and reasons

Predicate types

Type	Value	Passes when
`exit_code`	`0`	Command exits 0
`exit_code`	`1`	Command exits 1 (e.g. grep no matches)
`exit_code_not`	`0`	Command exits non-zero
`output_contains`	`"text"`	stdout contains text
`output_matches`	`"regex"`	stdout matches regex
`output_not_contains`	`"text"`	stdout does NOT contain text
`output_not_matches`	`"regex"`	stdout does NOT match regex
`tdd`	`0`	TDD enforcer. Must be observed failing before it can pass
`manual`	—	Human-verified. Confirmed via `dod_verify`, called explicitly by Claude, through a channel Claude cannot drive — see Manual verification
`review`	—	Fresh-context code review. The agent runs `/code-review` against the diff vs requirements, then calls `dod_verify`; the PASS/FAIL verdict arrives through the same out-of-band channel as `manual` (model cannot self-pass). For intent/edge-case correctness commands can't assert
`mutation`	`N` (default `0`)	Mutation testing. Runs the command in-band, parses surviving (un-killed) mutants from Stryker / mutmut / cargo-mutants output, and passes iff survivors `<= N`. Output it cannot parse FAILs (fail-safe — never auto-passes). The strongest signal that tests actually catch bugs; scope to changed/critical functions
`regression`	`tol` (fraction, e.g. `0.10`)	Non-regression gate. Two-phase: a capture run on pre-change code stores the metric baseline N0; later runs compare N1 against N0 with tolerance `tol`. `extract` (regex, group 1) or the last number in stdout picks the metric; unparseable output FAILs. `lower_is_better` (default true) for perf/complexity/duplication, false for coverage. Defaults to advisory — set `advisory: false` for a hard SLA gate. Proves quality doesn't regress vs a baseline, never an impossible absolute target

Advisory tier. Any proof may set advisory: true: a failing advisory proof warns loudly but does not fail its step or the overall verdict. regression proofs default to advisory. The flag is part of the proof fingerprint, so a hard gate cannot be silently downgraded.

TDD enforcement

The tdd predicate enforces test-driven development by requiring proof of a red-green cycle:

Write a failing test
Run dod_check — records the failure (RED phase, seen_failing=true)
Implement the feature
Run dod_check again — test passes AND was previously seen failing → proof passes

If a test passes without ever being observed failing, dod-guard rejects it with "TDD VIOLATION". This prevents writing tests after implementation that merely confirm existing behavior.

Manual verification

Some acceptance criteria can't be machine-checked (e.g. "the app launches and the dashboard renders correctly"). The manual predicate covers these — but in a way Claude cannot fake.

dod_check never auto-prompts. Every earlier version of dod-guard fired a popup for every unverified manual/review proof on every single dod_check run — including proofs on steps nobody was working on yet. Now dod_check only reads whatever verdict is already on record: an unverified proof reports skipped and holds the overall verdict at INCOMPLETE (not FAIL — "not yet checked" is distinct from "checked and failed").

dod_verify triggers the actual prompt. Claude calls dod_verify(dod_id, proof_id) explicitly, when it judges verification is actually relevant — typically right after implementing the step that proof belongs to, not preemptively for steps still ahead. This is the only path that fires the human-facing channel:

A distinctive audible jingle (Windows) plays to draw the user's attention.
The server asks the human directly through a channel Claude does not control:
- Popup (primary) — a server-spawned Windows dialog with PASS / FAIL buttons and a free-text notes field. No timeout — the human may take a while to respond, so it waits indefinitely rather than auto-failing on a clock.
- MCP elicitation (fallback) — used only where the popup can't run (non-Windows hosts) and the client advertises elicitation support.
The human's verdict (and any note) is recorded on the proof (manual_result) with a timestamp, channel, and a fingerprint of the proof text. Run dod_check afterward to fold it into the overall verdict.

Anti-cheat guarantee. Neither dod_check nor dod_verify accepts a parameter that could carry a "passed" verdict. The answer is sourced solely from the popup or elicitation — both outside the model's reach. Claude can choose when to call dod_verify, but cannot supply, infer, or fabricate the confirmation itself. If no human is available (non-interactive run, or no channel on this host), the proof fails — a missing human can never produce a pass. And since dod_check no longer auto-fires the channel, Claude cannot dodge the anti-cheat by simply never calling dod_verify either — an unrequested manual/review proof holds the whole DoD at INCOMPLETE, never PASS.

Persistence. Whatever dod_verify last recorded (PASS or FAIL) is what dod_check reports — it never re-prompts on its own. The record is keyed to a fingerprint of the proof's command, predicate, and description; if the proof changes (e.g. via dod_amend), the fingerprint no longer matches and dod_check reports it unverified (skipped) again until dod_verify is called afresh. To retry a FAIL, just call dod_verify again — nothing about a prior FAIL blocks a new attempt.

OS-correct commands (no bash-on-Windows)

Proof commands execute on the host OS via dod_check. To stop the common failure of authoring Linux/bash commands that then fail on a Windows host (and the slow dod_amend cleanup that follows), dod-guard validates commands up-front:

On dod_create, dod_import, and dod_amend, every non-manual proof command is parsed for the executables it invokes (across pipes and &&/||/; chains, respecting quotes).
Each executable is checked against the current OS: cmd.exe built-ins, anything on PATH (where / command -v), or a real file at the working directory.
If any tool is missing, the operation is rejected with the offending tools and a suggested native replacement (grep→findstr, cat→type, ls→dir, rm→del, …). The DoD is not created/amended until the commands are OS-correct.

This forces correct commands at authoring time instead of discovering breakage at check time. Checks that genuinely need a human use a manual proof (see above) rather than a shell command.

Tamper detection (blocking)

Each DoD stores a SHA256 fingerprint of its proof set at creation time. On every dod_check, the current fingerprint is compared to the stored original. If they don't match — the store was edited outside dod_amend — the verdict is forced to FAIL (not merely warned). dod_amend legitimately re-locks the fingerprint, so real changes go through the audited path; a raw store edit can never return PASS.

Baseline category enforcement

Every proof declares a category and each DoD declares a type (bug / general), keyed to the company baselines in standards/. dod_create rejects a DoD missing the mandatory machine-checkable categories — two-layer integration (integration_wiring + integration_behavioral) and the full test suite — and warns when TDD is absent or a step is proven only by presence/structural checks. The mandate is enforced by the tool, not left to the authoring agent's goodwill.

Incremental checking

dod_check accepts an optional step (1-based). A scoped run executes only that step's proofs and carries the others forward from their last result without re-running them — fast iteration without paying for the whole suite each time. A scoped run always returns INCOMPLETE, never PASS, so it can't satisfy a /goal completion gate; run dod_check with no step for the full verdict. Scoped runs never overwrite the canonical last full verdict.

Development

npm install
npm run build    # TypeScript compilation
npm run bundle   # esbuild → dist/bundle.js
npm start        # Run MCP server

License

MIT