dxkit
A deterministic stop condition and code-graph context layer for AI coding agents.
Autonomous coding loops face two control problems: orienting in the code while they make a change, and deciding whether that change made the repository worse before they stop.
dxkit addresses both. While the agent works, it provides a code graph of callers, callees, blast radius, and the files a change touches. Then, when the agent tries to stop, dxkit baselines existing findings, reruns trusted checks, and blocks only net-new detector-backed regressions with a concrete repair reason.
In our loop benchmark, vanilla Claude Code-style loops stopped with net-new debt in 11 of 16 runs. A prompt that told the agent to self-check still escaped 9 of 16. With dxkit's Stop-gate, we observed 0 of 16 escapes: when the loop tried to stop dirty, dxkit blocked, handed back the exact net-new finding, and the agent repaired before stopping clean.
Recorded from a real run on a synthetic repo, shortened for readability. Blocked and repaired inside the same warm loop.
dxkit does not reinvent detection. It runs trusted open source scanners (gitleaks, Semgrep, OSV, npm audit, and more), and it can ingest results from Snyk and CodeQL. What dxkit adds is the agent-loop layer around those tools: a per-stop, baseline-relative verdict of whether this change introduced a new finding, returned to the agent with the exact repair reason while the loop is still warm.
npm init @vyuhlabs/dxkit -- --claude-loop --yes # install dxkit + register the Claude Code Stop hook
npx vyuh-dxkit baseline create # grandfather today's findings
npx vyuh-dxkit loop doctor # verify the gate is wiredThe stop verdict has no model in the path: same input, same verdict. Existing debt stays grandfathered; only net-new regressions block.
Read the benchmark · Try it on your repo · Run the fixture gate
The problem: loops do not know when they made things worse
An autonomous loop runs until the agent decides it is done. The common checks in that loop (tests, linters, scanners, CI-style commands) usually answer whether something is broken or flagged. They do not, by themselves, maintain a brownfield baseline and answer the loop-level question: did this change introduce something net-new? So an agent can add a feature, leave a new untested path or a hardcoded credential behind, run the tests, see green, and declare success.
In our benchmark this happened in most vanilla runs, and telling the agent to check its own work only helped a little.
What dxkit does
- Build a structural code graph. dxkit gives the agent callers, callees, blast radius, and relevant files so it can orient before editing.
- Baseline today's debt.
baseline createrecords current findings, so pre-existing issues are grandfathered and never block. - Run a deterministic Stop-gate on every stop. A Claude Code Stop hook reruns the guardrail against that baseline. Same input gives the same verdict; no model decides whether the gate passes.
- Feed net-new findings back to the agent. If the change introduced a finding, the gate blocks the stop and hands the agent the exact finding to fix: do not refresh the baseline, do not touch unrelated debt, fix what this branch introduced. The loop stops only when clean.
Why only net-new findings?
Grandfathered does not mean accepted.
dxkit blocks only net-new findings for two reasons: agent-loop attribution and brownfield adoption.
First, an autonomous coding loop needs a scoped stop condition. When an agent tries to declare done, the relevant question is not "is this entire repository debt-free?" It is:
did this loop make the repository worse than the baseline?
If the gate asks the agent to fix every pre-existing finding before it may stop, the repair target becomes noisy and unbounded. The agent may churn unrelated code, spend context on old debt, or refresh the baseline to escape. dxkit instead holds the loop accountable for the change it just made: fix what this branch introduced, do not touch unrelated debt, and do not move the baseline.
Second, dxkit is designed for brownfield repositories. Existing debt may include hundreds or thousands of findings. If the first gate required a repo to reach zero findings, most teams could not adopt agentic development workflows until after a large cleanup project. That is backwards. The first control invariant is simpler and stricter:
this agent must not make the repository worse than the baseline.
baseline create records the current state so existing findings remain visible
and auditable, but they do not block the current loop. When an agent changes the
repo, dxkit blocks only findings introduced by that change. This lets teams adopt
agentic workflows immediately, prevent regression from day one, and pay down the
old baseline as a separate, deliberate workstream.
A baseline refresh is a governance action, not a repair action. If the Stop-gate blocks, the agent should fix the net-new finding it introduced and should not move the baseline.
Who this is for
Use dxkit if you let coding agents:
- run unattended or semi-attended,
- fix CI or review comments in loops,
- touch brownfield repos that already carry debt,
- or work where "new debt" matters more than "all debt."
What dxkit is, and is not
It is a deterministic verification layer. It baselines today's findings, fingerprints them across churn, and blocks only net-new regressions.
It is not a scanner replacement. It runs and ingests scanners (gitleaks, Semgrep, CodeQL, Snyk, SARIF) and makes their findings enforceable. It does not claim to find more bugs than they do.
It is not an LLM judge. No model decides whether the gate passes. The model can repair findings. The gate itself is deterministic, and the prompt does not grow as the baseline grows.
It is not a guarantee of safe code. It blocks detector-backed net-new findings it can observe. You still need tests, review, scanners, and judgment.
Built on tools you already trust
dxkit is an orchestration and enforcement layer, not another scanner. It runs established open source tools and treats their output as one stream. Which tools run depends on the languages in your repo. dxkit covers 8 ecosystems (TypeScript / JavaScript, Python, Go, Rust, C# / .NET, Java, Kotlin, Ruby).
Universal, on every repo:
- secrets: gitleaks
- code patterns: Semgrep
- dependency advisories: OSV.dev
- size, duplication, and the code graph: cloc, jscpd, graphify
Per language, dxkit adds that ecosystem's own linter and audit tool. For
example, npm audit + ESLint (JS / TS), pip-audit + ruff (Python), govulncheck +
golangci-lint (Go), cargo-audit + clippy (Rust), dotnet list --vulnerable
(C#), osv-scanner + PMD (Java), osv-scanner + detekt (Kotlin), and
bundler-audit + RuboCop (Ruby). The full per-language matrix is in Per-pack
capabilities below.
For deep interprocedural analysis, it ingests findings from Snyk Code and CodeQL (or any SARIF file), fingerprints them the same way as native findings, and runs them through the same baseline and gate. You keep the detectors you already have. dxkit makes their findings enforceable inside CI and inside the agent loop.
| Layer | Examples | Job |
|---|---|---|
| Detection | gitleaks, Semgrep, OSV, npm audit, Snyk, CodeQL, SARIF | Find issues |
| dxkit | baseline, fingerprint matcher, Stop-gate, loop ledger | Decide whether this change introduced something net-new |
| Agent | Claude Code or another coding loop | Repair the exact finding and try to stop again |
Try it on your repo
The Stop hook runs dxkit on every stop, so install dxkit into the repo. This
one command adds it as a devDependency and registers the hook additively, so your
existing .claude settings are preserved:
npm init @vyuhlabs/dxkit -- --claude-loop --yes
npx vyuh-dxkit baseline create # grandfather today's findings
npx vyuh-dxkit loop doctor # verify the gate is wired safely and dxkit resolves
# then run Claude Code as you normally would. The Stop-gate fires on every stop.
npx vyuh-dxkit loop ledger summarize # afterwards: blocked vs allowed, repaired-after-blockWhen the agent tries to stop, dxkit runs the net-new gate against the baseline. Existing findings are grandfathered; only findings this change introduced block.
Run a local fixture gate
Want to see the Stop-gate before installing dxkit into your repo?
npx -y @vyuhlabs/dxkit@latest demo loop-guardrailThis runs the real gate on a temporary fixture repo: baseline → introduce a
net-new secret → BLOCK → repair → CLEAN, then it tears the fixture down. No API
key and no Claude Code, and your own repo is never touched. It needs gitleaks
installed and takes about 20 seconds; without gitleaks it shows a clearly
labelled illustration instead. (It does a one-time npx download, so it is not
fully offline, though the gate itself is.)
Presets: what blocks the loop
security-only (default) secrets and critical or high vulnerabilities. Bounded, must-fix, cheap to gate.
full-debt (opt-in) also gates test gaps and maintainability regressions. Repairs can be expensive.
The default is security-only. The headline escape-rate benchmark used
full-debt (it gated both the secret trap and the test-gap trap); the default
install starts narrower so a first run does not trap users in expensive
test-generation loops. Switch with
npm init @vyuhlabs/dxkit -- --claude-loop --loop-preset full-debt.
Give the agent a map, not just a gate
The Stop-gate controls what a loop is allowed to ship. The code graph controls how the agent does the work in between. When dxkit scaffolds a repo it builds a code graph and installs skills that drive real development off it, so the agent orients by querying structure instead of grepping and re-reading whole files.
- Build a feature (
dxkit-featureskill): query the graph for where the feature plugs in, what patterns already exist, and what the change will touch, then implement against those patterns and run the analyzers on the result before it stops. - Fix a finding (
dxkit-actionskill): take a flagged finding, pull its callers, callees, and blast radius from the graph, repair it, and confirm the change did not introduce something net-new.
The agent gets callers, callees, and blast radius up front as a budget-bounded slice, not a pile of file reads. It is the same graph, the same baseline, and the same identity contract the gate already uses.
What the benchmarks actually show is predictable spend, not guaranteed cheaper spend. On a large repo the median was roughly tied, the worst-case session used about 57% fewer tokens, and the variance was roughly halved. On a small repo the overhead was about zero. The graph caps the expensive tail. It does not promise a lower average, and it does not make the agent write better code on its own.
This is a different axis from detection. Snyk, SonarQube, and CodeQL tell you what is wrong. They do not give the agent a map of the code or bound how much it spends finding its way around. dxkit does both: the gate bounds what the loop ships, the graph bounds how the loop works.
The numbers
Three independent benchmark results, one theme: dxkit makes agent work more predictable.
| Layer | What it bounds | Observed result |
|---|---|---|
| Stop-gate | net-new detector-backed debt | vanilla loops escaped 11/16 times, prompt-only checklist escaped 9/16, dxkit escaped 0/16 |
| Deterministic identity | false "net-new" findings under churn | caught all 3 seeded regressions with 0/2 false blocks on clean edits; 0 false net-new on tested line shifts and renames |
| Graph context | large-repo exploration tails | median roughly tied, but large-repo mean tokens 30% lower, worst case 57% lower, variance roughly halved |
Deferral has a re-orientation cost. A fourth arm of the loop-safety study measured the "detect on CI, fix later" model: on the test-gap task, deferring a net-new finding to a cold session cost ~49% more in equivalent cost and ~51% more turns than repairing it inside the warm loop, because the cold fixer has to re-orient in a context it no longer holds. (The secret-task premium pointed the same way but was weak (mean +19%, median slightly negative), so we lean on the robust test-gap result.) So the gate is not just safer than deferring, it is plausibly cheaper too.
And the gate is fast enough to run on every stop. dxkit 2.14.0 scopes the Stop-gate scan to the active preset's blockable finding kinds and re-scans only the changed files, reusing cached results for everything unchanged. The verdict is identical to a full scan; the cost is seconds per stop, not minutes, even on large repositories.
Benchmark caveats: the loop-safety study uses controlled synthetic tasks plus real-repo validation, detector-backed findings, and Sonnet runs. It is not a CVE corpus, not a claim of better detection, and not a guarantee that dxkit catches every possible bug. The claim is narrower: for findings the detector observes, dxkit gives the loop a deterministic net-new stop decision.
Full methodology, reproducibility notes, artifact status, and caveats are in docs/benchmarks.md.
Why not just Snyk, SonarQube, or CodeQL?
Use them. dxkit can ingest their findings. The difference is tempo and control, not detection. Cloud scanners are strong detection engines, and they usually run on a CI or PR cadence. A coding-agent loop needs a local stop decision every time the agent tries to declare done.
| Loop Stop-gate need | dxkit | Cloud or CI scanners |
|---|---|---|
| Runs locally on every stop, in seconds | yes | usually CI or cloud cadence |
| Deterministic verdict, no model in the gate | yes | varies (some add an LLM judge) |
| Grandfathers existing debt | yes | tool-dependent |
| Feeds the exact block reason back to the warm agent session | yes | usually a human-facing dashboard or PR |
The goal is not to replace scanners. It is to make their findings enforceable at the speed of the agent loop.
Beyond loops
The same deterministic core powers the rest of dxkit: pre-push and CI guardrails, brownfield baselines, durable finding identity, SARIF, CodeQL, and Snyk ingest, a six-dimension health report, code-graph context, and a set of Claude Code skills. See the docs.
Languages
dxkit covers 8 ecosystems. Detection is automatic from your manifests and source; each language brings its own native linter, dependency-audit tool, and coverage parser, layered on the universal scanners (gitleaks, Semgrep, OSV, cloc, jscpd, graphify).
| Language | Detected by | Native linter + audit |
|---|---|---|
| TypeScript / JavaScript | package.json |
ESLint, npm audit |
| Python | pyproject.toml, *.py |
ruff, pip-audit |
| Go | go.mod |
golangci-lint, govulncheck |
| Rust | Cargo.toml |
clippy, cargo-audit |
| C# / .NET | *.csproj, *.sln |
dotnet-format, dotnet list --vulnerable |
| Java | pom.xml, src/main/java/ |
PMD, osv-scanner |
| Kotlin | *.gradle{.kts,}, *.kt |
detekt, osv-scanner |
| Ruby | Gemfile, *.rb |
RuboCop, bundler-audit |
Per-pack capabilities: coverage import, import-graph, severity tiers (click to expand)
| Language | Detection | Coverage import | Import-graph | Native tools | Lint severity tiers | Vuln severity tiers |
|---|---|---|---|---|---|---|
| TS / JS | package.json |
Istanbul | import/require/re-export | eslint, npm audit, vitest-coverage | ESLint rule ID | npm audit native |
| Python | pyproject.toml, setup.py, *.py |
coverage.py | import/from | ruff, pip-audit, coverage | ruff code prefix | pip-audit + OSV.dev (CVSS v3+v4) |
| Go | go.mod |
coverprofile | import blocks | golangci-lint, govulncheck | FromLinter family |
govulncheck embedded + OSV.dev |
| Rust | Cargo.toml |
lcov + cobertura | use statements, extracted only¹ | clippy, cargo-audit, cargo-llvm-cov | clippy group | cargo-audit native |
| C# | *.csproj, *.sln |
cobertura XML | using declarations, extracted only¹ | dotnet-format (formatter) | format-only² | dotnet list --vulnerable |
| Kotlin | gradle/*.gradle{.kts,}, *.kt |
JaCoCo XML | import statements, extracted only¹ | detekt, osv-scanner (Maven) | detekt severity | osv-scanner + OSV.dev (Maven) |
| Java | pom.xml, src/main/java/, *.java |
JaCoCo XML | import statements, extracted only¹ | PMD, osv-scanner (Maven) | PMD priority tiers | osv-scanner + OSV.dev (Maven) |
| Ruby | *.rb |
SimpleCov JSON | require/require_relative, extracted only¹ | rubocop, bundler-audit, osv-scanner | rubocop severity | bundler-audit + osv-scanner (Gemfile.lock) |
¹ Rust, C#, Kotlin, Java, and Ruby populate imports.extracted but the
file-level resolver is a no-op. Downstream analyses that need an edge graph
(reachability, import-graph test-gap credit) degrade to conservative
defaults for those packs. Resolvers are tracked on the roadmap.
² C# uses dotnet-format for formatting violations only. A real
severity-tiered C# linter (Roslyn analyzers or StyleCop) is on the
roadmap. Today every C# formatting violation is counted at low tier
so it does not inflate the Code Quality score.
Reproduce the deterministic tier
The deterministic results (the net-new gate decision and the finding-identity
matcher) reproduce offline with no API key, so you do not have to trust our
numbers. These harnesses live in benchmarks/:
node benchmarks/bench-guardrail.mjs config.json # block/allow on seeded findings
node benchmarks/bench-netnew-isolation.mjs config.json # net-new isolation under churn
node benchmarks/bench-matcher.mjs config.json # false net-new on line shifts + renamesSee benchmarks/README.md to point them at a repo. The agent-driven harnesses
(loop safety, cost of deferral, gate-vs-LLM, and the graph-context sessions) need
a model subscription or API key and are published under benchmarks/agentic/.
Full methodology, the per-study reports, caveats, and repro steps:
docs/benchmarks.md.
Credits
dxkit stands on excellent open source tools. It orchestrates them, it does not replace them. Thank you to the maintainers of graphify (the code graph), gitleaks, Semgrep, OSV-Scanner, jscpd, and cloc. Each tool is installed separately and keeps its own license.
Contributing and roadmap
- Contributing guide: CONTRIBUTING.md
- Roadmap: docs/roadmap.md
- License: MIT