npm.io
0.1.0 • Published 15h agoCLI

@jetmiky/diablo

Licence
MIT
Version
0.1.0
Deps
0
Size
355 kB
Vulns
0
Weekly
0

Diablo

A central conductor that runs your skills through the Pi coding agent in isolated git worktrees, stopping at human gates and handing work off as git commits.

What it is

Diablo is not the brain — your skills are. Diablo is the conductor: it decides which model tier runs which skill, in which worktree, reading which inputs, then stops at which human gate.

  • Skill-driven — your skills provide the procedures (grilling, PRD, issues, TDD, handoff, refactor). Diablo injects them into Pi via @file references.
  • Central, not a swarm — one conductor dispatches Pi runs synchronously. No daemon, no message bus, no scheduler.
  • Git as the event store — work transfers between steps as commits, so every step is durable and resumable.
  • Human gates — three distinct ones, never conflated: intake (idea → spec, front of the pipeline), plan negotiation (spec → frozen plan, the middle gate), and per-commit approval (during a run). Interactive steps hand you the keyboard; AFK steps run headless.

Credits

Diablo conducts a set of engineering skills authored by Matt Pocockgithub.com/mattpocock/skills. The skills (master-plan, tdd, grill-with-docs, to-prd, to-issues, handoff, and others) are the "brain" diablo orchestrates; diablo itself is only the conductor. All credit for the skill methodology belongs to the original author.

The orchestrated skills are vendored into this repo under skills/ so the engine and the skills it drives evolve in lockstep (the plan parser is a strict contract on the master-plan skill's output) and a fresh clone is self-contained. Those vendored copies are verbatim, unmodified copies of the upstream skills, not derivative configurations — updating from upstream is a clean re-copy, and what runs is exactly what upstream published. See skills/UPSTREAM.md for provenance and the vendored set.

Status

Early development. Building the core step-execution primitive first (sequential, single-issue), parallel multi-issue later.

Try it

New to diablo? docs/tutorial/ is a step-by-step, follow-along guide that runs diablo end to end on a small toy project (a Roman numeral library). It builds two features: a converter run directly with diablo run (auto-plan), and arithmetic run through diablo plandiablo run (negotiate and freeze a plan first) — covering init, plan, implement, verify, and integrate.

Requirements

diablo is a conductor — it does not contain a model. Every step it runs is a Pi coding agent invocation, so Pi must be installed and on your PATH, with its provider/credentials configured. Node ≥ 22 (or Bun) is needed to run diablo itself.

How diablo finds the pi binary

diablo and Pi are each installed globally however you like — npm, bun, or pnpm — and each manager drops its global binaries in a different directory:

Manager Typical global bin dir
npm $(npm prefix -g)/bin (e.g. /usr/local/bin, or an nvm node dir)
bun ~/.bun/bin
pnpm ~/.local/share/pnpm (or $PNPM_HOME)

To work regardless of how Pi was installed, diablo spawns the agent by the bare command name pi and lets the OS resolve it against your PATH (Node's child_process.spawn of a slash-less command uses execvp, the same lookup a shell does). So the only requirement is that which pi succeeds in the environment you run diablo from — a global install via any manager satisfies that.

DIABLO_PI_BIN — pointing at a non-PATH install

If your pi is not on PATH (a one-off build, a pinned version, a sandbox), set an explicit absolute path and diablo uses it verbatim:

export DIABLO_PI_BIN="/opt/pi/bin/pi"
diablo run my-issue

Resolution order: DIABLO_PI_BIN (if set and non-empty) → otherwise the bare name pi resolved via PATH.

Tip: if you hit spawn pi ENOENT, diablo could not find Pi on PATH. Either add Pi's global bin dir to PATH (e.g. in ~/.bashrc / ~/.profile) or set DIABLO_PI_BIN to its absolute path.

Model tiers

Tier Model Thinking Used for
planner-high claude-opus-4.8 high grilling, master plan
planner-med claude-opus-4.8 medium per-stage design (grounded in committed code), final verification
worker claude-sonnet-4.5 medium implementation
verifier claude-sonnet-4.5 medium per-stage verification

Each implementation stage runs design → worker → verifier: a planner-med design step reads the code prior stages actually committed and writes a short design note (functions/types/files with signatures) that the worker implements against. The frozen plan stays behavior-level; the per-stage design grounds the interface in real code rather than guessing at plan time. The final Verification stage escalates to the planner-med tier (a holistic, whole-feature judgment), while mid-pipeline verifiers stay cheap on the verifier tier.

Cross-tier review (cost/quality knob). By default the worker and the per-stage verifier are the same class (claude-sonnet-4.5), so each stage is reviewed by a peer of the model that wrote it. The deterministic gate (see the done gate and ADR 0001) already makes typecheck/tests a measured fact, so what the verifier still judges is the softer "do the acceptance criteria hold" call. If you want stronger per-stage review and accept the extra cost on the most frequent step in a run, set the verifier a tier above the worker:

diablo run <issue> --verifier-model claude-opus-4.8

or persist it as "models": { "verifier": "claude-opus-4.8" } in diablo.config.json. The default stays cheap on purpose (ADR 0003); this is an opt-in, not a recommendation for every run.

Configure

diablo init scaffolds a project for diablo with sensible defaults (non-interactive, AFK-friendly). It creates:

  • diablo.config.json — config scaffold (idempotent)
  • .gitignore — merges diablo managed block
  • AGENTS.md — agent guidance doc
  • CONTEXT.md — single-context domain doc
  • .scratch/ — local issue tracker directory
  • docs/agents/ — triage labels, issue tracker, and domain conventions
Interactive mode

diablo init -i (or --interactive) runs the full interactive flow: a Pi skill-setup session, bootstrap confirmation, and package manager choice.

Customization flags
diablo init                                  # all defaults
diablo init --claude                         # CLAUDE.md instead of AGENTS.md
diablo init --context multiple               # per-scope context layout
diablo init --triage-labels "ready,done,wip" # custom triage labels
diablo init --no-triage-labels               # skip triage scaffold
diablo init --package-manager bun            # bootstrap with bun
diablo init --setup-skills                   # run Pi skill-setup after scaffold
diablo init -i                               # full interactive flow

When you opt in to bootstrap (--bootstrap or --package-manager), diablo runs git init (if needed) plus installs husky/commitlint with the chosen manager. Choosing skip runs git init only and installs no Node tooling — the escape hatch for non-Node projects (Go, Rust, Python), since husky/commitlint require Node regardless of manager. Config is optional — diablo runs with built-in defaults when no file is present.

{
  "models": {
    "planner": "claude-opus-4.8",
    "worker": "claude-sonnet-4.5",
    "verifier": "claude-sonnet-4.5",
  },
  "integration": {
    "targetBranch": "main",
    "branchPrefix": "diablo/",
    "autoMerge": false,
  },
  "gate": "none",
  "retry": { "limit": 2 },
  "verify": { "commands": [] },
  "limits": { "stepTimeoutMs": 1200000, "runBudgetMs": 14400000, "maxSteps": 200 },
}

This block shows the built-in defaults — writing it out is the same as writing {}. Every field is optional. A present key overrides only itself; everything else keeps its built-in default. A malformed value (bad JSON, wrong type, unknown enum) fails loudly at load time rather than silently reverting — so a typo can never quietly change how a run behaves.

Telegram push (diablo telegram setup)

Run progress can be pushed to Telegram. Credentials are never stored in diablo.config.json; they live in a per-repo, gitignored .diablo/telegram.json written by an interactive setup command:

diablo telegram setup   # prompts for the bot token and chat id, writes .diablo/telegram.json

Because the file lives under the machine-managed .diablo/ dir, it inherits that dir's gitignore rule — the token can never be committed. At run time the credentials resolve as env > file > disabled: the environment (DIABLO_TELEGRAM_BOT_TOKEN / DIABLO_TELEGRAM_CHAT_ID) overrides the file per field (handy for CI and one-off runs), the two sources are mixable, and Telegram stays off unless both a bot token and a chat id resolve. One bot per project keeps each repo pushing to its own chat with its own getUpdates slot.

Field reference
models — which model runs each tier
"models": {
  "planner": "claude-opus-4.8",    // grilling, master plan, per-stage design, final verify
  "worker": "claude-sonnet-4.5",   // implementation
  "verifier": "claude-sonnet-4.5", // per-stage verification
}

Each value is a model name only — diablo adds the provider (9router/kr) and the per-tier thinking level (planner → high/medium, worker/verifier → medium) at run time. You set what model; the tier table owns how hard it thinks.

Value Implication
omitted (default) planner claude-opus-4.8, worker & verifier claude-sonnet-4.5 — the cost/quality split the pipeline is tuned for: an expensive brain plans and judges, cheaper hands implement.
a stronger worker (e.g. claude-opus-4.8) higher implementation quality, materially higher cost and latency on the step that runs most often.
a cheaper planner/verifier (e.g. claude-haiku-4.5) faster, cheaper toy/scratch runs; weaker plans and shallower verdicts, so a bad plan or a missed regression is more likely to slip through.

Precedence (each layer overrides the one before):

built-in defaults  ←  diablo.config.json  ←  CLI flag (--planner-model, ...)

So --worker-model claude-haiku-4.5 on a single diablo run beats the config for that run only, without editing the file.

integration — what happens to the work branch after a passing run
"integration": {
  "targetBranch": "main",      // branch work is cut from, and merged back into
  "branchPrefix": "diablo/",   // work branch is <prefix><issue>
  "autoMerge": false,          // merge into targetBranch on PASS, or leave it
}
Field Values Implication
targetBranch any branch name (default main) the work branch is cut FROM this and (if autoMerge) merged back INTO it. Point it at develop or a release branch to keep main untouched.
branchPrefix any string (default diablo/) the work branch is <branchPrefix><issue>. Change it to namespace diablo's branches (e.g. bot/, ai/) for branch-protection or filtering.
autoMerge: false default on a final PASS the branch is left intact and diablo prints the exact git merge command. A passing verdict is not the same as "the human wants this in main" — you stay the gatekeeper of the trunk.
autoMerge: true opt-in a clean merge lands automatically in the primary working copy. A merge conflict is never auto-resolved: diablo aborts the merge cleanly, lists the conflicting files, and prints the manual command.
retry — how many times a failed implementation re-tries before halting
"retry": { "limit": 2 }

limit is the number of EXTRA worker attempts after the first, on a verifier VERDICT: FAIL [implementation]. The failed verifier feedback is injected into the re-run so it fixes the specific defect rather than blindly redoing the stage.

Value Implication
0 fail-fast — the first implementation FAIL halts the stage to a human. Cheapest, least autonomous.
2 (default) up to two self-corrections per stage before halting. Absorbs most "almost right" worker misses without supervision.
higher more autonomy on flaky stages, but more spend on a stage that may be failing for a structural reason a retry can't fix.

Note: a VERDICT: FAIL [plan] (the plan itself is wrong, not the code) always halts immediately regardless of limit — diablo never auto-replans, because the frozen plan is a hard contract. Retries only ever re-run the worker.

gate — human approval checkpoint
"gate": "none"   // or "approval"

Controls whether diablo run / diablo refactor pause for a human y/N during an otherwise-autonomous run. The pause fires after a stage's work is committed AND has passed verification — so you're approving a verified result, not a raw mid-flight diff. Decline (anything not starting with y, including a bare Enter) halts the run cleanly: the committed work stays on the worktree branch and the pipeline stops with a clear message — a human halt, not a failure.

Value Implication
"none" default — fully AFK. No step ever pauses; the run goes from plan to final verdict to integration without asking. This is diablo's autonomous-conductor identity.
"approval" a y/N checkpoint after every verifying step: each per-stage verifier (once the stage passes) and the final whole-feature verification. You review each verified chunk and decide whether to proceed to the next stage.

What is not gated, even under "approval": the design and worker steps. The worker runs unattended (it's explicitly told not to ask for approval, since there's no human in its loop), and the retry loop self-corrects an implementation FAIL before the gate is ever consulted — so you're only asked once a stage has genuinely passed. Note this is orthogonal to integration.autoMerge: gate checkpoints between stages during a run; autoMerge decides what happens to the branch after the whole run passes. The PRD-approval prompt inside diablo intake is a separate checkpoint and is unaffected by this field.

verify — the deterministic gate diablo runs itself
"verify": { "commands": ["bun run typecheck", "bun test"] }

The commands diablo runs itself in the worktree after a verifying step, so a stage's verdict is a MEASURED fact, not the verifier model's self-report. Each command runs in order; a non-zero exit fails the stage regardless of what the verifier LLM claimed — a green VERDICT: PASS over a failing bun test is still a FAIL. A measured failure is treated as [implementation] and flows into the normal worker-retry loop; a genuine LLM FAIL [plan] still halts to a human.

Value Implication
[] (default) no deterministic gate — verification is the verifier LLM's verdict alone. diablo prints a loud warning at the start of each run so you know the gate isn't measured.
["bun test", ...] diablo runs each command in the worktree; the measured exit codes have teeth. This is what makes "safe to run AFK" hold even if the model misreports.

Set these to your project's real check commands (bun run typecheck, bun test, npm test, cargo test, a wrapper script, …). Commands are split on whitespace; wrap anything needing shell features in a script and point a command at it.

limits — safety ceilings for an unattended run
"limits": {
  "stepTimeoutMs": 1200000,   // 20 min — a single step is killed past this
  "runBudgetMs": 14400000,    // 4 h — the whole run aborts past this
  "maxSteps": 200,            // circuit breaker on total agent steps
}

Generous defaults that exist to stop a pathological hang or runaway, never to clip a legitimately long run. A step exceeding stepTimeoutMs is killed (the underlying process is terminated) and the run halts to a human; a run exceeding runBudgetMs or maxSteps aborts cleanly, preserving committed work and reporting which limit tripped. Lower them for tight CI-style runs; raise them for genuinely large features on slow models.

skillsDir — override the vendored skills location
"skillsDir": "/abs/path/to/skills"

Omitted (the default), diablo resolves the skills/ directory vendored into its own package by walking up from its module location — never your project's cwd, so a fresh clone is self-contained and diablo run works from any directory. Set this only if you deliberately want to point the engine at a different skills set (e.g. a local fork while debugging the plan parser). Pointing it at skills whose master-plan output doesn't match diablo's plan parser will break plan loading — this is the escape hatch the memory note calls "fork only when a hard contract forces it."

Branch integration

Each run does its work on <branchPrefix><issue> (default diablo/<issue>), cut from targetBranch. After a final PASS:

  • autoMerge: false (default) — the branch is left intact and diablo prints the exact git merge command. A passing verdict is not the same as "the human wants this in main."
  • autoMerge: true + a clean merge — the branch is merged into targetBranch in the primary working copy.
  • A merge conflict — diablo aborts the merge cleanly, lists the conflicting files, and prints the manual command. Conflicts are never auto-resolved.

Intake

diablo intake <feature> runs the requirement-gathering phase IN FRONT of run. Unlike run (autonomous, AFK), intake is interactive and Socratic — it cannot be AFK — so it is a separate command:

  1. grill-with-docs — an interactive session that gathers requirements, adapting to the project: brownfield reads existing code + CONTEXT.md, greenfield starts from an empty glossary.
  2. state-machine modeling (optional) — for stateful features, an interactive domain-modeling session enumerates states/transitions/guards/events and writes a state-machine.md artifact the PRD step then incorporates. You're asked up front; declining skips it cleanly so simple features aren't burdened.
  3. to-prd — authors a PRD from the gathered requirements (and the state machine, when modeled).
  4. human approval checkpoint — you approve the PRD before it is decomposed; declining stops cleanly with the PRD saved and no issues written.
  5. to-issues — decomposes the approved PRD into tracked issues under .scratch/<feature>/, which diablo run then picks up.

Plan negotiation

diablo plan <issue> is the middle gate — between a tracked issue (the spec) and an AFK run (the implementation). The most expensive, least-reversible part of the pipeline is a full multi-stage build; planning is cheap to discuss. So the plan becomes a negotiated artifact you shape WITH the planner before it is frozen, rather than something run writes silently and immediately executes.

  1. Propose + self-grill — the planner writes a proposed staged plan and, in its reply, summarizes the approach, what it deliberately is NOT doing, and the risks/assumptions/open-questions it surfaces itself. You aren't the sole critic.
  2. Negotiate — you challenge; the planner defends or revises. A challenge is treated as a hypothesis to evaluate, not an order to obey: if it's technically wrong or would damage the design, the planner says so and explains why, citing the spec or the code. It revises only when a challenge exposes a real gap. No reflexive agreement (the same anti-sycophancy posture the verifier verdict has).
  3. Freeze — only on an explicit approve (type the bare word). The planner rewrites a clean frozen plan plus a ## Decisions & rationale section distilling the why from the negotiation, and the issue status becomes planned. abort cancels without freezing.

The negotiation is a single resumed Pi session (the planner accumulates the full exchange), foreground and interactive — this is NOT the AFK run.

How run relates to a plan:

  • A frozen plan (status planned) — run executes it.
  • A draft plan (a plan file exists but was never approved) — run rejects it. This is where the gate gets its teeth: the expensive build never proceeds on a plan you didn't approve.
  • No plan at allrun auto-plans non-interactively then executes (the full-AFK escape hatch for trivial issues you want to fire-and-forget).

diablo plan invoked with no issue opens an interactive selector of the issues discovered under .scratch/, each tagged with its status. The same selector backs a bare diablo run. In a non-interactive context (backgrounded, piped) the selector cannot run, so it fails fast asking for an explicit issue name rather than hanging — so an AFK run never blocks on input that will never arrive.

Issue status (lifecycle) vs triage label

Two orthogonal axes track an issue — never conflate them:

Triage label Issue status (lifecycle)
Values needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix open, planned, in-progress, needs-human, done
Means should a human/agent pick this up? where is it in the run pipeline?
Set by a human, by hand diablo, automatically
Lives in the Status: line inside the issue .md (the vendored Matt Pocock skill's format, kept verbatim) .diablo/<issue>/state.json (gitignored runtime state)

Keeping the lifecycle in state.json rather than the .md is deliberate: it never clobbers the human triage label. Diablo sets the lifecycle status as it runs (planplanned, run start → in-progress, the done gate → done or needs-human), and the selector reads it to badge each issue. Because integration.autoMerge defaults off, an issue can be done yet unmerged — shown as done (unmerged), and the condition diablo plan warns about on a later issue.

Concurrent runs of the same issue

diablo run <issue> takes a per-issue lock before touching the worktree, so two overlapping runs of the same issue can't race into the same branch and corrupt each other's commits and progress state. The lock is a small file under the gitignored .diablo/<issue>/run.lock (never committed), recording the owning pid and start time.

  • A second run while one is live fails fast with a non-zero exit and a clear message (issue <X> is already being run (pid …, started …)) — and never mutates the worktree.
  • The lock is released on every exit path: completion, a clean halt (needs-human), or a crash.
  • Staleness is decided by liveness, not a timeout: if the owning process is gone (a crashed run), the next run detects the dead pid and reclaims the lock automatically — a crash never permanently wedges an issue. A corrupt lockfile is likewise treated as no lock.
  • The lock is per-issue: different issues run concurrently, unaffected by each other's locks.

Reclaiming a worktree: diablo clean

Every run leaves an isolated worktree at .worktrees/<issue>/ and a diablo/<issue> branch behind. That's deliberate — run is resume-aware and reuses them, so nothing is ever removed automatically (ADR 0002). When you're actually done with an issue, reclaim its space explicitly:

diablo clean <issue>              # remove the worktree + delete the branch
diablo clean <issue> --keep-branch  # remove only the worktree
diablo clean <issue> --force        # remove even if the branch isn't merged

The safety guard is the point: without --force, clean refuses to remove a worktree whose branch is not merged into the target branch — so you can't lose unmerged work by accident. With --force it skips the guard and force-deletes the branch. If the worktree is already gone it's a harmless no-op (idempotent). Because nothing auto-deletes, a halted run is always resumable exactly as before.

The done gate

A run finishing PASS is necessary but not sufficient — the issue's acceptance criteria must actually be proven, not assumed. The final (planner-tier) verification maps each acceptance criterion to concrete evidence (a test name, code path, or command output) as a CRITERIA: checklist. Then, mechanically:

  • every criterion checked AND VERDICT: PASS → status done (and the issue file's acceptance boxes are flipped to [x]);
  • any criterion unproven, or the verifier didn't address them all → status needs-human, listing exactly what is unmet. Never a silent done.

An issue with no acceptance criteria falls back to a weak gate (PASS alone suffices) with a warning — done means proven, but a ticket that specified nothing to prove can't be held to criteria it doesn't have.

Run vs refactor

diablo run <issue> and diablo refactor <area> share ONE engine — the same design → worker → verifier → final-verify pipeline and integration. They differ only in the planner skill injected:

Command Planner skill Produces
diablo run <issue> master-plan an implementation plan from a ticket
diablo refactor <area> improve-codebase-architecture a refactor plan for an area

Refactor is human-initiated, never auto-detected — deciding "this is large enough to refactor" is a human judgment. A refactor plan can surface new issues, which flow back through to-issuesdiablo run. Same engine, looped.

Progress

A run emits structured progress events through a ProgressPort to three sinks:

  • stdout — a one-line status per discrete event (always on). While a step is in flight, an animated braille spinner with an elapsed timer is redrawn in place (carriage return, no newline), so a long step never looks hung.
  • progress.md — a LIVE tracker in the worktree's .plans/, updated every discrete event with per-stage status (TODO/IN_PROGRESS/DONE/HALTED), commit SHA, verdict, retries, and a Pending Todos list. Each stage's handoff note (the worker's carry-forward narrative: decisions, deferrals, gotchas) is folded into the same artifact — one file, no drift. Liveness heartbeats are ignored here (the tracker is structural, not a per-second sink).
  • Telegram — push notifications rendered as Telegram-HTML (the supported <b>/<i>/<code>/<pre>/<a> subset, escaped for path/SHA/code-heavy content). Liveness is shown as a single live bubble edited in place and throttled to one edit per 15s (respecting the Bot API edit rate limit); a discrete event closes the bubble so the next heartbeat opens a fresh one. Enabled when a bot token and chat id both resolve, from either the environment (DIABLO_TELEGRAM_BOT_TOKEN / DIABLO_TELEGRAM_CHAT_ID) or the per-repo .diablo/telegram.json file written by diablo telegram setup — env wins per field, the two sources are mixable, and a partial config leaves Telegram off. No credentials are read from diablo.config.json or committed (the file is gitignored via the existing .diablo/ rule).
Liveness

Long agent steps are otherwise silent for minutes. A heartbeat ticks while each step runs, so every surface can show the run is alive and for how long. Diablo streams Pi's --mode json stdout line by line and surfaces each tool_execution_start as a short activity label, so the heartbeat reflects what the agent is doing right now (for example editing run-step.ts, running `bun test` , or searching for "TODO") instead of a bare working. The label falls back to the bare tool name for tools it doesn't recognise, and to working before the first tool starts.

Liveness is push-only and best-effort: a failing sink (e.g. Telegram down) never halts a run, and the activity stream is parsed defensively (a malformed line is ignored, never thrown). Idle-vs-working is derived from the event stream (waiting-for-approval = idle).

Two-way interactive approval over Telegram is intentionally not built: unattended runs use gate: none (the verifier is the automated checkpoint), and interactive runs already approve over stdin (gate: approval) from the terminal. A Telegram approval gate would sit only in the narrow "AFK but still approving by hand" gap, at the cost of an inbound poller, pairing/security, and timeout/stale-tap handling — not worth it for that slice.

Develop

bun install
bun test
bun run typecheck

Keywords