2.18.1 • Published 5d agoCLI

@vyuhlabs/dxkit

Licence

MIT

Version

2.18.1

Deps

Size

5.4 MB

Vulns

Weekly

654

Summary Dependency Versions

dxkit

A deterministic stop condition and code-graph context layer for AI coding agents.

Autonomous coding loops face two control problems: orienting in the code while they make a change, and deciding whether that change made the repository worse before they stop.

dxkit addresses both. While the agent works, it provides a code graph of callers, callees, blast radius, and the files a change touches. Then, when the agent tries to stop, dxkit baselines existing findings, reruns trusted checks, and blocks only net-new detector-backed regressions with a concrete repair reason.

In our loop benchmark, vanilla Claude Code-style loops stopped with net-new debt in 11 of 16 runs. A prompt that told the agent to self-check still escaped 9 of 16. With dxkit's Stop-gate, we observed 0 of 16 escapes: when the loop tried to stop dirty, dxkit blocked, handed back the exact net-new finding, and the agent repaired before stopping clean.

dxkit's Stop-gate blocks a coding-agent loop on a net-new critical dependency vulnerability, the agent bumps the version, and the gate goes clean.

_{Recorded from a real run on a synthetic repo, shortened for readability. Blocked and repaired inside the same warm loop.}

dxkit does not reinvent detection. It runs trusted open source scanners (gitleaks, Semgrep, OSV, npm audit, and more), and it can ingest results from Snyk and CodeQL. What dxkit adds is the agent-loop layer around those tools: a per-stop, baseline-relative verdict of whether this change introduced a new finding, returned to the agent with the exact repair reason while the loop is still warm.

npm init @vyuhlabs/dxkit -- --claude-loop --yes   # install dxkit + register the Claude Code Stop hook
npx vyuh-dxkit baseline create                    # grandfather today's findings
npx vyuh-dxkit loop doctor                         # verify the gate is wired

The stop verdict has no model in the path: same input, same verdict. Existing debt stays grandfathered; only net-new regressions block.

Read the benchmark · Try it on your repo · Run the fixture gate

The problem: loops do not know when they made things worse

An autonomous loop runs until the agent decides it is done. The common checks in that loop (tests, linters, scanners, CI-style commands) usually answer whether something is broken or flagged. They do not, by themselves, maintain a brownfield baseline and answer the loop-level question: did this change introduce something net-new? So an agent can add a feature, leave a new untested path or a hardcoded credential behind, run the tests, see green, and declare success.

In our benchmark this happened in most vanilla runs, and telling the agent to check its own work only helped a little.

What dxkit does

Build a structural code graph. dxkit gives the agent callers, callees, blast radius, and relevant files so it can orient before editing.
Baseline today's debt. baseline create records current findings, so pre-existing issues are grandfathered and never block.
Run a deterministic Stop-gate on every stop. A Claude Code Stop hook reruns the guardrail against that baseline. Same input gives the same verdict; no model decides whether the gate passes.
Feed net-new findings back to the agent. If the change introduced a finding, the gate blocks the stop and hands the agent the exact finding to fix: do not refresh the baseline, do not touch unrelated debt, fix what this branch introduced. The loop stops only when clean.

Why only net-new findings?

Grandfathered does not mean accepted.

dxkit blocks only net-new findings for two reasons: agent-loop attribution and brownfield adoption.

First, an autonomous coding loop needs a scoped stop condition. When an agent tries to declare done, the relevant question is not "is this entire repository debt-free?" It is:

did this loop make the repository worse than the baseline?

If the gate asks the agent to fix every pre-existing finding before it may stop, the repair target becomes noisy and unbounded. The agent may churn unrelated code, spend context on old debt, or refresh the baseline to escape. dxkit instead holds the loop accountable for the change it just made: fix what this branch introduced, do not touch unrelated debt, and do not move the baseline.

Second, dxkit is designed for brownfield repositories. Existing debt may include hundreds or thousands of findings. If the first gate required a repo to reach zero findings, most teams could not adopt agentic development workflows until after a large cleanup project. That is backwards. The first control invariant is simpler and stricter:

this agent must not make the repository worse than the baseline.

baseline create records the current state so existing findings remain visible and auditable, but they do not block the current loop. When an agent changes the repo, dxkit blocks only findings introduced by that change. This lets teams adopt agentic workflows immediately, prevent regression from day one, and pay down the old baseline as a separate, deliberate workstream.

A baseline refresh is a governance action, not a repair action. If the Stop-gate blocks, the agent should fix the net-new finding it introduced and should not move the baseline.

Who this is for

Use dxkit if you let coding agents:

run unattended or semi-attended,
fix CI or review comments in loops,
touch brownfield repos that already carry debt,
or work where "new debt" matters more than "all debt."

What dxkit is, and is not

It is a deterministic verification layer. It baselines today's findings, fingerprints them across churn, and blocks only net-new regressions.

It is not a scanner replacement. It runs and ingests scanners (gitleaks, Semgrep, CodeQL, Snyk, SARIF) and makes their findings enforceable. It does not claim to find more bugs than they do.

It is not an LLM judge. No model decides whether the gate passes. The model can repair findings. The gate itself is deterministic, and the prompt does not grow as the baseline grows.

It is not a guarantee of safe code. It blocks detector-backed net-new findings it can observe. You still need tests, review, scanners, and judgment.

Built on tools you already trust

dxkit is an orchestration and enforcement layer, not another scanner. It runs established open source tools and treats their output as one stream. Which tools run depends on the languages in your repo. dxkit covers 8 ecosystems (TypeScript / JavaScript, Python, Go, Rust, C# / .NET, Java, Kotlin, Ruby).

Universal, on every repo:

secrets: gitleaks
code patterns: Semgrep
dependency advisories: OSV.dev
size, duplication, and the code graph: cloc, jscpd, graphify

Per language, dxkit adds that ecosystem's own linter and audit tool. For example, npm audit + ESLint (JS / TS), pip-audit + ruff (Python), govulncheck + golangci-lint (Go), cargo-audit + clippy (Rust), dotnet list --vulnerable (C#), osv-scanner + PMD (Java), osv-scanner + detekt (Kotlin), and bundler-audit + RuboCop (Ruby). The full per-language matrix is in Per-pack capabilities below.

For deep interprocedural analysis, it ingests findings from Snyk Code and CodeQL (or any SARIF file), fingerprints them the same way as native findings, and runs them through the same baseline and gate. You keep the detectors you already have. dxkit makes their findings enforceable inside CI and inside the agent loop.

Layer	Examples	Job
Detection	gitleaks, Semgrep, OSV, npm audit, Snyk, CodeQL, SARIF	Find issues
dxkit	baseline, fingerprint matcher, Stop-gate, loop ledger	Decide whether this change introduced something net-new
Agent	Claude Code or another coding loop	Repair the exact finding and try to stop again

Try it on your repo

The Stop hook runs dxkit on every stop, so install dxkit into the repo. This one command adds it as a devDependency and registers the hook additively, so your existing .claude settings are preserved:

npm init @vyuhlabs/dxkit -- --claude-loop --yes
npx vyuh-dxkit baseline create      # grandfather today's findings
npx vyuh-dxkit loop doctor          # verify the gate is wired safely and dxkit resolves
# then run Claude Code as you normally would. The Stop-gate fires on every stop.
npx vyuh-dxkit loop ledger summarize  # afterwards: blocked vs allowed, repaired-after-block

When the agent tries to stop, dxkit runs the net-new gate against the baseline. Existing findings are grandfathered; only findings this change introduced block.

Run a local fixture gate

Want to see the Stop-gate before installing dxkit into your repo?

npx -y @vyuhlabs/dxkit@latest demo loop-guardrail

This runs the real gate on a temporary fixture repo: baseline → introduce a net-new secret → BLOCK → repair → CLEAN, then it tears the fixture down. No API key and no Claude Code, and your own repo is never touched. It needs gitleaks installed and takes about 20 seconds; without gitleaks it shows a clearly labelled illustration instead. (It does a one-time npx download, so it is not fully offline, though the gate itself is.)

Presets: what blocks the loop

security-only  (default)  secrets and critical or high vulnerabilities. Bounded, must-fix, cheap to gate.
full-debt      (opt-in)   also gates test gaps and maintainability regressions. Repairs can be expensive.

The default is security-only. The headline escape-rate benchmark used full-debt (it gated both the secret trap and the test-gap trap); the default install starts narrower so a first run does not trap users in expensive test-generation loops. Switch with npm init @vyuhlabs/dxkit -- --claude-loop --loop-preset full-debt.

Give the agent a map, not just a gate

The Stop-gate controls what a loop is allowed to ship. The code graph controls how the agent does the work in between. When dxkit scaffolds a repo it builds a code graph and installs skills that drive real development off it, so the agent orients by querying structure instead of grepping and re-reading whole files.

Build a feature (dxkit-feature skill): query the graph for where the feature plugs in, what patterns already exist, and what the change will touch, then implement against those patterns and run the analyzers on the result before it stops.
Fix a finding (dxkit-action skill): take a flagged finding, pull its callers, callees, and blast radius from the graph, repair it, and confirm the change did not introduce something net-new.

The agent gets callers, callees, and blast radius up front as a budget-bounded slice, not a pile of file reads. It is the same graph, the same baseline, and the same identity contract the gate already uses.

What the benchmarks actually show is predictable spend, not guaranteed cheaper spend. On a large repo the median was roughly tied, the worst-case session used about 57% fewer tokens, and the variance was roughly halved. On a small repo the overhead was about zero. The graph caps the expensive tail. It does not promise a lower average, and it does not make the agent write better code on its own.

This is a different axis from detection. Snyk, SonarQube, and CodeQL tell you what is wrong. They do not give the agent a map of the code or bound how much it spends finding its way around. dxkit does both: the gate bounds what the loop ships, the graph bounds how the loop works.

The numbers

Three independent benchmark results, one theme: dxkit makes agent work more predictable.

Layer	What it bounds	Observed result
Stop-gate	net-new detector-backed debt	vanilla loops escaped 11/16 times, prompt-only checklist escaped 9/16, dxkit escaped 0/16
Deterministic identity	false "net-new" findings under churn	caught all 3 seeded regressions with 0/2 false blocks on clean edits; 0 false net-new on tested line shifts and renames
Graph context	large-repo exploration tails	median roughly tied, but large-repo mean tokens 30% lower, worst case 57% lower, variance roughly halved

Deferral has a re-orientation cost. A fourth arm of the loop-safety study measured the "detect on CI, fix later" model: on the test-gap task, deferring a net-new finding to a cold session cost ~49% more in equivalent cost and ~51% more turns than repairing it inside the warm loop, because the cold fixer has to re-orient in a context it no longer holds. (The secret-task premium pointed the same way but was weak (mean +19%, median slightly negative), so we lean on the robust test-gap result.) So the gate is not just safer than deferring, it is plausibly cheaper too.

And the gate is fast enough to run on every stop. dxkit 2.14.0 scopes the Stop-gate scan to the active preset's blockable finding kinds and re-scans only the changed files, reusing cached results for everything unchanged. The verdict is identical to a full scan; the cost is seconds per stop, not minutes, even on large repositories.

Benchmark caveats: the loop-safety study uses controlled synthetic tasks plus real-repo validation, detector-backed findings, and Sonnet runs. It is not a CVE corpus, not a claim of better detection, and not a guarantee that dxkit catches every possible bug. The claim is narrower: for findings the detector observes, dxkit gives the loop a deterministic net-new stop decision.

Full methodology, reproducibility notes, artifact status, and caveats are in docs/benchmarks.md.

Why not just Snyk, SonarQube, or CodeQL?

Use them. dxkit can ingest their findings. The difference is tempo and control, not detection. Cloud scanners are strong detection engines, and they usually run on a CI or PR cadence. A coding-agent loop needs a local stop decision every time the agent tries to declare done.

Loop Stop-gate need	dxkit	Cloud or CI scanners
Runs locally on every stop, in seconds	yes	usually CI or cloud cadence
Deterministic verdict, no model in the gate	yes	varies (some add an LLM judge)
Grandfathers existing debt	yes	tool-dependent
Feeds the exact block reason back to the warm agent session	yes	usually a human-facing dashboard or PR

The goal is not to replace scanners. It is to make their findings enforceable at the speed of the agent loop.

Beyond loops

The same deterministic core powers the rest of dxkit: pre-push and CI guardrails, brownfield baselines, durable finding identity, SARIF, CodeQL, and Snyk ingest, a six-dimension health report, code-graph context, and a set of Claude Code skills. See the docs.

Languages

dxkit covers 8 ecosystems. Detection is automatic from your manifests and source; each language brings its own native linter, dependency-audit tool, and coverage parser, layered on the universal scanners (gitleaks, Semgrep, OSV, cloc, jscpd, graphify).

Language	Detected by	Native linter + audit
TypeScript / JavaScript	`package.json`	ESLint, npm audit
Python	`pyproject.toml`, `*.py`	ruff, pip-audit
Go	`go.mod`	golangci-lint, govulncheck
Rust	`Cargo.toml`	clippy, cargo-audit
C# / .NET	`.csproj`, `.sln`	dotnet-format, `dotnet list --vulnerable`
Java	`pom.xml`, `src/main/java/`	PMD, osv-scanner
Kotlin	`.gradle{.kts,}`, `.kt`	detekt, osv-scanner
Ruby	`Gemfile`, `*.rb`	RuboCop, bundler-audit

Per-pack capabilities: coverage import, import-graph, severity tiers (click to expand)

Language	Detection	Coverage import	Import-graph	Native tools	Lint severity tiers	Vuln severity tiers
TS / JS	`package.json`	Istanbul	import/require/re-export	eslint, npm audit, vitest-coverage	ESLint rule ID	npm audit native
Python	`pyproject.toml`, `setup.py`, `*.py`	coverage.py	import/from	ruff, pip-audit, coverage	ruff code prefix	pip-audit + OSV.dev (CVSS v3+v4)
Go	`go.mod`	coverprofile	import blocks	golangci-lint, govulncheck	`FromLinter` family	govulncheck embedded + OSV.dev
Rust	`Cargo.toml`	lcov + cobertura	use statements, extracted only¹	clippy, cargo-audit, cargo-llvm-cov	clippy group	cargo-audit native
C#	`.csproj`, `.sln`	cobertura XML	using declarations, extracted only¹	dotnet-format (formatter)	format-only²	dotnet list --vulnerable
Kotlin	gradle/`.gradle{.kts,}`, `.kt`	JaCoCo XML	import statements, extracted only¹	detekt, osv-scanner (Maven)	detekt severity	osv-scanner + OSV.dev (Maven)
Java	`pom.xml`, `src/main/java/`, `*.java`	JaCoCo XML	import statements, extracted only¹	PMD, osv-scanner (Maven)	PMD priority tiers	osv-scanner + OSV.dev (Maven)
Ruby	`*.rb`	SimpleCov JSON	require/require_relative, extracted only¹	rubocop, bundler-audit, osv-scanner	rubocop severity	bundler-audit + osv-scanner (Gemfile.lock)

¹ Rust, C#, Kotlin, Java, and Ruby populate imports.extracted but the file-level resolver is a no-op. Downstream analyses that need an edge graph (reachability, import-graph test-gap credit) degrade to conservative defaults for those packs. Resolvers are tracked on the roadmap.

² C# uses dotnet-format for formatting violations only. A real severity-tiered C# linter (Roslyn analyzers or StyleCop) is on the roadmap. Today every C# formatting violation is counted at low tier so it does not inflate the Code Quality score.

Reproduce the deterministic tier

The deterministic results (the net-new gate decision and the finding-identity matcher) reproduce offline with no API key, so you do not have to trust our numbers. These harnesses live in benchmarks/:

node benchmarks/bench-guardrail.mjs config.json        # block/allow on seeded findings
node benchmarks/bench-netnew-isolation.mjs config.json # net-new isolation under churn
node benchmarks/bench-matcher.mjs config.json          # false net-new on line shifts + renames

See benchmarks/README.md to point them at a repo. The agent-driven harnesses (loop safety, cost of deferral, gate-vs-LLM, and the graph-context sessions) need a model subscription or API key and are published under benchmarks/agentic/. Full methodology, the per-study reports, caveats, and repro steps: docs/benchmarks.md.

Credits

dxkit stands on excellent open source tools. It orchestrates them, it does not replace them. Thank you to the maintainers of graphify (the code graph), gitleaks, Semgrep, OSV-Scanner, jscpd, and cloc. Each tool is installed separately and keeps its own license.

Contributing and roadmap

Contributing guide: CONTRIBUTING.md
Roadmap: docs/roadmap.md
License: MIT

Keywords

ai-coding-agents agentic-development claude-code codex guardrails brownfield baseline static-analysis code-quality security-scanning pre-commit-hooks ci-cd semgrep snyk codeql sarif dxkit vyuh