@jetmiky/diablo
Diablo
A central conductor that runs your skills through the Pi coding agent in isolated git worktrees, stopping at human gates and handing work off as git commits.
What it is
Diablo is not the brain — your skills are. Diablo is the conductor: it decides which model tier runs which skill, in which worktree, reading which inputs, then stops at which human gate.
- Skill-driven — your skills provide the procedures (grilling, PRD, issues, TDD, handoff, refactor). Diablo injects them into Pi via
@filereferences. - Central, not a swarm — one conductor dispatches Pi runs synchronously. No daemon, no message bus, no scheduler.
- Git as the event store — work transfers between steps as commits, so every step is durable and resumable.
- Human gates — three distinct ones, never conflated: intake (idea → spec, front of the pipeline), plan negotiation (spec → frozen plan, the middle gate), and per-commit approval (during a run). Interactive steps hand you the keyboard; AFK steps run headless.
Credits
Diablo conducts a set of engineering skills authored by Matt Pocock — github.com/mattpocock/skills. The skills (master-plan, tdd, grill-with-docs, to-prd, to-issues, handoff, and others) are the "brain" diablo orchestrates; diablo itself is only the conductor. All credit for the skill methodology belongs to the original author.
The orchestrated skills are vendored into this repo under skills/ so the engine and the skills it drives evolve in lockstep (the plan parser is a strict contract on the master-plan skill's output) and a fresh clone is self-contained. Those vendored copies are verbatim, unmodified copies of the upstream skills, not derivative configurations — updating from upstream is a clean re-copy, and what runs is exactly what upstream published. See skills/UPSTREAM.md for provenance and the vendored set.
Status
Early development. Building the core step-execution primitive first (sequential, single-issue), parallel multi-issue later.
Try it
New to diablo? docs/tutorial/ is a step-by-step,
follow-along guide that runs diablo end to end on a small toy project (a Roman
numeral library). It builds two features: a converter run directly with
diablo run (auto-plan), and arithmetic run through diablo plan → diablo run
(negotiate and freeze a plan first) — covering init, plan, implement, verify,
and integrate.
Requirements
diablo is a conductor — it does not contain a model. Every step it runs is a
Pi coding agent invocation, so Pi must
be installed and on your PATH, with its provider/credentials configured.
Node ≥ 22 (or Bun) is needed to run diablo itself.
How diablo finds the pi binary
diablo and Pi are each installed globally however you like — npm, bun, or pnpm — and each manager drops its global binaries in a different directory:
| Manager | Typical global bin dir |
|---|---|
| npm | $(npm prefix -g)/bin (e.g. /usr/local/bin, or an nvm node dir) |
| bun | ~/.bun/bin |
| pnpm | ~/.local/share/pnpm (or $PNPM_HOME) |
To work regardless of how Pi was installed, diablo spawns the agent by the bare
command name pi and lets the OS resolve it against your PATH (Node's
child_process.spawn of a slash-less command uses execvp, the same lookup a
shell does). So the only requirement is that which pi succeeds in the
environment you run diablo from — a global install via any manager satisfies
that.
DIABLO_PI_BIN — pointing at a non-PATH install
If your pi is not on PATH (a one-off build, a pinned version, a sandbox), set
an explicit absolute path and diablo uses it verbatim:
export DIABLO_PI_BIN="/opt/pi/bin/pi"
diablo run my-issueResolution order: DIABLO_PI_BIN (if set and non-empty) → otherwise the bare
name pi resolved via PATH.
Tip: if you hit
spawn pi ENOENT, diablo could not find Pi onPATH. Either add Pi's global bin dir toPATH(e.g. in~/.bashrc/~/.profile) or setDIABLO_PI_BINto its absolute path.
Model tiers
| Tier | Model | Thinking | Used for |
|---|---|---|---|
| planner-high | claude-opus-4.8 |
high | grilling, master plan |
| planner-med | claude-opus-4.8 |
medium | per-stage design (grounded in committed code), final verification |
| worker | claude-sonnet-4.5 |
medium | implementation |
| verifier | claude-sonnet-4.5 |
medium | per-stage verification |
Each implementation stage runs design → worker → verifier: a planner-med
design step reads the code prior stages actually committed and writes a short
design note (functions/types/files with signatures) that the worker implements
against. The frozen plan stays behavior-level; the per-stage design grounds the
interface in real code rather than guessing at plan time. The final
Verification stage escalates to the planner-med tier (a holistic,
whole-feature judgment), while mid-pipeline verifiers stay cheap on the verifier
tier.
Cross-tier review (cost/quality knob). By default the worker and the per-stage verifier are the same class (
claude-sonnet-4.5), so each stage is reviewed by a peer of the model that wrote it. The deterministic gate (see the done gate and ADR 0001) already makes typecheck/tests a measured fact, so what the verifier still judges is the softer "do the acceptance criteria hold" call. If you want stronger per-stage review and accept the extra cost on the most frequent step in a run, set the verifier a tier above the worker:diablo run <issue> --verifier-model claude-opus-4.8or persist it as
"models": { "verifier": "claude-opus-4.8" }indiablo.config.json. The default stays cheap on purpose (ADR 0003); this is an opt-in, not a recommendation for every run.
Configure
diablo init scaffolds a project for diablo with sensible defaults (non-interactive, AFK-friendly). It creates:
diablo.config.json— config scaffold (idempotent).gitignore— merges diablo managed blockAGENTS.md— agent guidance docCONTEXT.md— single-context domain doc.scratch/— local issue tracker directorydocs/agents/— triage labels, issue tracker, and domain conventions
Interactive mode
diablo init -i (or --interactive) runs the full interactive flow: a Pi skill-setup session, bootstrap confirmation, and package manager choice.
Customization flags
diablo init # all defaults
diablo init --claude # CLAUDE.md instead of AGENTS.md
diablo init --context multiple # per-scope context layout
diablo init --triage-labels "ready,done,wip" # custom triage labels
diablo init --no-triage-labels # skip triage scaffold
diablo init --package-manager bun # bootstrap with bun
diablo init --setup-skills # run Pi skill-setup after scaffold
diablo init -i # full interactive flowWhen you opt in to bootstrap (--bootstrap or --package-manager), diablo runs git init (if needed) plus installs husky/commitlint with the chosen manager. Choosing skip runs git init only and installs no Node tooling — the escape hatch for non-Node projects (Go, Rust, Python), since husky/commitlint require Node regardless of manager.
Config is optional — diablo runs with built-in defaults when no file is present.
{
"models": {
"planner": "claude-opus-4.8",
"worker": "claude-sonnet-4.5",
"verifier": "claude-sonnet-4.5",
},
"integration": {
"targetBranch": "main",
"branchPrefix": "diablo/",
"autoMerge": false,
},
"gate": "none",
"retry": { "limit": 2 },
"verify": { "commands": [] },
"limits": { "stepTimeoutMs": 1200000, "runBudgetMs": 14400000, "maxSteps": 200 },
}This block shows the built-in defaults — writing it out is the same as writing
{}. Every field is optional. A present key overrides only itself; everything else
keeps its built-in default. A malformed value (bad JSON, wrong type, unknown
enum) fails loudly at load time rather than silently reverting — so a typo can
never quietly change how a run behaves.
Telegram push (diablo telegram setup)
Run progress can be pushed to Telegram. Credentials are never stored in
diablo.config.json; they live in a per-repo, gitignored .diablo/telegram.json
written by an interactive setup command:
diablo telegram setup # prompts for the bot token and chat id, writes .diablo/telegram.jsonBecause the file lives under the machine-managed .diablo/ dir, it inherits that
dir's gitignore rule — the token can never be committed. At run time the
credentials resolve as env > file > disabled: the environment
(DIABLO_TELEGRAM_BOT_TOKEN / DIABLO_TELEGRAM_CHAT_ID) overrides the file per
field (handy for CI and one-off runs), the two sources are mixable, and Telegram
stays off unless both a bot token and a chat id resolve. One bot per project
keeps each repo pushing to its own chat with its own getUpdates slot.
Field reference
models — which model runs each tier
"models": {
"planner": "claude-opus-4.8", // grilling, master plan, per-stage design, final verify
"worker": "claude-sonnet-4.5", // implementation
"verifier": "claude-sonnet-4.5", // per-stage verification
}Each value is a model name only — diablo adds the provider (9router/kr)
and the per-tier thinking level (planner → high/medium, worker/verifier →
medium) at run time. You set what model; the tier table owns how hard it
thinks.
| Value | Implication |
|---|---|
| omitted (default) | planner claude-opus-4.8, worker & verifier claude-sonnet-4.5 — the cost/quality split the pipeline is tuned for: an expensive brain plans and judges, cheaper hands implement. |
a stronger worker (e.g. claude-opus-4.8) |
higher implementation quality, materially higher cost and latency on the step that runs most often. |
a cheaper planner/verifier (e.g. claude-haiku-4.5) |
faster, cheaper toy/scratch runs; weaker plans and shallower verdicts, so a bad plan or a missed regression is more likely to slip through. |
Precedence (each layer overrides the one before):
built-in defaults ← diablo.config.json ← CLI flag (--planner-model, ...)
So --worker-model claude-haiku-4.5 on a single diablo run beats the config
for that run only, without editing the file.
integration — what happens to the work branch after a passing run
"integration": {
"targetBranch": "main", // branch work is cut from, and merged back into
"branchPrefix": "diablo/", // work branch is <prefix><issue>
"autoMerge": false, // merge into targetBranch on PASS, or leave it
}| Field | Values | Implication |
|---|---|---|
targetBranch |
any branch name (default main) |
the work branch is cut FROM this and (if autoMerge) merged back INTO it. Point it at develop or a release branch to keep main untouched. |
branchPrefix |
any string (default diablo/) |
the work branch is <branchPrefix><issue>. Change it to namespace diablo's branches (e.g. bot/, ai/) for branch-protection or filtering. |
autoMerge: false |
default | on a final PASS the branch is left intact and diablo prints the exact git merge command. A passing verdict is not the same as "the human wants this in main" — you stay the gatekeeper of the trunk. |
autoMerge: true |
opt-in | a clean merge lands automatically in the primary working copy. A merge conflict is never auto-resolved: diablo aborts the merge cleanly, lists the conflicting files, and prints the manual command. |
retry — how many times a failed implementation re-tries before halting
"retry": { "limit": 2 }limit is the number of EXTRA worker attempts after the first, on a verifier
VERDICT: FAIL [implementation]. The failed verifier feedback is injected into
the re-run so it fixes the specific defect rather than blindly redoing the stage.
| Value | Implication |
|---|---|
0 |
fail-fast — the first implementation FAIL halts the stage to a human. Cheapest, least autonomous. |
2 (default) |
up to two self-corrections per stage before halting. Absorbs most "almost right" worker misses without supervision. |
| higher | more autonomy on flaky stages, but more spend on a stage that may be failing for a structural reason a retry can't fix. |
Note: a VERDICT: FAIL [plan] (the plan itself is wrong, not the code) always
halts immediately regardless of limit — diablo never auto-replans, because the
frozen plan is a hard contract. Retries only ever re-run the worker.
gate — human approval checkpoint
"gate": "none" // or "approval"Controls whether diablo run / diablo refactor pause for a human y/N during
an otherwise-autonomous run. The pause fires after a stage's work is committed
AND has passed verification — so you're approving a verified result, not a raw
mid-flight diff. Decline (anything not starting with y, including a bare Enter)
halts the run cleanly: the committed work stays on the worktree branch and the
pipeline stops with a clear message — a human halt, not a failure.
| Value | Implication |
|---|---|
"none" |
default — fully AFK. No step ever pauses; the run goes from plan to final verdict to integration without asking. This is diablo's autonomous-conductor identity. |
"approval" |
a y/N checkpoint after every verifying step: each per-stage verifier (once the stage passes) and the final whole-feature verification. You review each verified chunk and decide whether to proceed to the next stage. |
What is not gated, even under "approval": the design and worker steps. The
worker runs unattended (it's explicitly told not to ask for approval, since
there's no human in its loop), and the retry loop self-corrects an implementation
FAIL before the gate is ever consulted — so you're only asked once a stage has
genuinely passed. Note this is orthogonal to integration.autoMerge: gate
checkpoints between stages during a run; autoMerge decides what happens to
the branch after the whole run passes. The PRD-approval prompt inside diablo intake is a separate checkpoint and is unaffected by this field.
verify — the deterministic gate diablo runs itself
"verify": { "commands": ["bun run typecheck", "bun test"] }The commands diablo runs itself in the worktree after a verifying step, so a
stage's verdict is a MEASURED fact, not the verifier model's self-report. Each
command runs in order; a non-zero exit fails the stage regardless of what the
verifier LLM claimed — a green VERDICT: PASS over a failing bun test is still
a FAIL. A measured failure is treated as [implementation] and flows into the
normal worker-retry loop; a genuine LLM FAIL [plan] still halts to a human.
| Value | Implication |
|---|---|
[] (default) |
no deterministic gate — verification is the verifier LLM's verdict alone. diablo prints a loud warning at the start of each run so you know the gate isn't measured. |
["bun test", ...] |
diablo runs each command in the worktree; the measured exit codes have teeth. This is what makes "safe to run AFK" hold even if the model misreports. |
Set these to your project's real check commands (bun run typecheck, bun test,
npm test, cargo test, a wrapper script, …). Commands are split on whitespace;
wrap anything needing shell features in a script and point a command at it.
limits — safety ceilings for an unattended run
"limits": {
"stepTimeoutMs": 1200000, // 20 min — a single step is killed past this
"runBudgetMs": 14400000, // 4 h — the whole run aborts past this
"maxSteps": 200, // circuit breaker on total agent steps
}Generous defaults that exist to stop a pathological hang or runaway, never to
clip a legitimately long run. A step exceeding stepTimeoutMs is killed (the
underlying process is terminated) and the run halts to a human; a run exceeding
runBudgetMs or maxSteps aborts cleanly, preserving committed work and
reporting which limit tripped. Lower them for tight CI-style runs; raise them for
genuinely large features on slow models.
skillsDir — override the vendored skills location
"skillsDir": "/abs/path/to/skills"Omitted (the default), diablo resolves the skills/ directory vendored into its
own package by walking up from its module location — never your project's cwd,
so a fresh clone is self-contained and diablo run works from any directory. Set
this only if you deliberately want to point the engine at a different skills set
(e.g. a local fork while debugging the plan parser). Pointing it at skills whose
master-plan output doesn't match diablo's plan parser will break plan loading —
this is the escape hatch the memory note calls "fork only when a hard contract
forces it."
Branch integration
Each run does its work on <branchPrefix><issue> (default diablo/<issue>),
cut from targetBranch. After a final PASS:
autoMerge: false(default) — the branch is left intact and diablo prints the exactgit mergecommand. A passing verdict is not the same as "the human wants this in main."autoMerge: true+ a clean merge — the branch is merged intotargetBranchin the primary working copy.- A merge conflict — diablo aborts the merge cleanly, lists the conflicting files, and prints the manual command. Conflicts are never auto-resolved.
Intake
diablo intake <feature> runs the requirement-gathering phase IN FRONT of
run. Unlike run (autonomous, AFK), intake is interactive and Socratic — it
cannot be AFK — so it is a separate command:
- grill-with-docs — an interactive session that gathers requirements,
adapting to the project: brownfield reads existing code +
CONTEXT.md, greenfield starts from an empty glossary. - state-machine modeling (optional) — for stateful features, an interactive
domain-modelingsession enumerates states/transitions/guards/events and writes astate-machine.mdartifact the PRD step then incorporates. You're asked up front; declining skips it cleanly so simple features aren't burdened. - to-prd — authors a PRD from the gathered requirements (and the state machine, when modeled).
- human approval checkpoint — you approve the PRD before it is decomposed; declining stops cleanly with the PRD saved and no issues written.
- to-issues — decomposes the approved PRD into tracked issues under
.scratch/<feature>/, whichdiablo runthen picks up.
Plan negotiation
diablo plan <issue> is the middle gate — between a tracked issue (the spec) and
an AFK run (the implementation). The most expensive, least-reversible part of
the pipeline is a full multi-stage build; planning is cheap to discuss. So the
plan becomes a negotiated artifact you shape WITH the planner before it is
frozen, rather than something run writes silently and immediately executes.
- Propose + self-grill — the planner writes a proposed staged plan and, in its reply, summarizes the approach, what it deliberately is NOT doing, and the risks/assumptions/open-questions it surfaces itself. You aren't the sole critic.
- Negotiate — you challenge; the planner defends or revises. A challenge is treated as a hypothesis to evaluate, not an order to obey: if it's technically wrong or would damage the design, the planner says so and explains why, citing the spec or the code. It revises only when a challenge exposes a real gap. No reflexive agreement (the same anti-sycophancy posture the verifier verdict has).
- Freeze — only on an explicit
approve(type the bare word). The planner rewrites a clean frozen plan plus a## Decisions & rationalesection distilling the why from the negotiation, and the issue status becomesplanned.abortcancels without freezing.
The negotiation is a single resumed Pi session (the planner accumulates the full
exchange), foreground and interactive — this is NOT the AFK run.
How run relates to a plan:
- A frozen plan (status
planned) —runexecutes it. - A draft plan (a plan file exists but was never approved) —
runrejects it. This is where the gate gets its teeth: the expensive build never proceeds on a plan you didn't approve. - No plan at all —
runauto-plans non-interactively then executes (the full-AFK escape hatch for trivial issues you want to fire-and-forget).
diablo plan invoked with no issue opens an interactive selector of the issues
discovered under .scratch/, each tagged with its status. The same selector
backs a bare diablo run. In a non-interactive context (backgrounded, piped) the
selector cannot run, so it fails fast asking for an explicit issue name rather
than hanging — so an AFK run never blocks on input that will never arrive.
Issue status (lifecycle) vs triage label
Two orthogonal axes track an issue — never conflate them:
| Triage label | Issue status (lifecycle) | |
|---|---|---|
| Values | needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix |
open, planned, in-progress, needs-human, done |
| Means | should a human/agent pick this up? | where is it in the run pipeline? |
| Set by | a human, by hand | diablo, automatically |
| Lives in | the Status: line inside the issue .md (the vendored Matt Pocock skill's format, kept verbatim) |
.diablo/<issue>/state.json (gitignored runtime state) |
Keeping the lifecycle in state.json rather than the .md is deliberate: it
never clobbers the human triage label. Diablo sets the lifecycle status as it
runs (plan → planned, run start → in-progress, the done gate → done or
needs-human), and the selector reads it to badge each issue. Because
integration.autoMerge defaults off, an issue can be done yet unmerged —
shown as done (unmerged), and the condition diablo plan warns about on a
later issue.
Concurrent runs of the same issue
diablo run <issue> takes a per-issue lock before touching the worktree, so two
overlapping runs of the same issue can't race into the same branch and corrupt
each other's commits and progress state. The lock is a small file under the
gitignored .diablo/<issue>/run.lock (never committed), recording the owning
pid and start time.
- A second run while one is live fails fast with a non-zero exit and a clear
message (
issue <X> is already being run (pid …, started …)) — and never mutates the worktree. - The lock is released on every exit path: completion, a clean halt
(
needs-human), or a crash. - Staleness is decided by liveness, not a timeout: if the owning process is gone (a crashed run), the next run detects the dead pid and reclaims the lock automatically — a crash never permanently wedges an issue. A corrupt lockfile is likewise treated as no lock.
- The lock is per-issue: different issues run concurrently, unaffected by each other's locks.
Reclaiming a worktree: diablo clean
Every run leaves an isolated worktree at .worktrees/<issue>/ and a
diablo/<issue> branch behind. That's deliberate — run is resume-aware and
reuses them, so nothing is ever removed automatically (ADR 0002). When
you're actually done with an issue, reclaim its space explicitly:
diablo clean <issue> # remove the worktree + delete the branch
diablo clean <issue> --keep-branch # remove only the worktree
diablo clean <issue> --force # remove even if the branch isn't mergedThe safety guard is the point: without --force, clean refuses to remove a
worktree whose branch is not merged into the target branch — so you can't lose
unmerged work by accident. With --force it skips the guard and force-deletes
the branch. If the worktree is already gone it's a harmless no-op (idempotent).
Because nothing auto-deletes, a halted run is always resumable exactly as before.
The done gate
A run finishing PASS is necessary but not sufficient — the issue's acceptance
criteria must actually be proven, not assumed. The final (planner-tier)
verification maps each acceptance criterion to concrete evidence (a test name,
code path, or command output) as a CRITERIA: checklist. Then, mechanically:
- every criterion checked AND
VERDICT: PASS→ statusdone(and the issue file's acceptance boxes are flipped to[x]); - any criterion unproven, or the verifier didn't address them all → status
needs-human, listing exactly what is unmet. Never a silent done.
An issue with no acceptance criteria falls back to a weak gate (PASS alone suffices) with a warning — done means proven, but a ticket that specified nothing to prove can't be held to criteria it doesn't have.
Run vs refactor
diablo run <issue> and diablo refactor <area> share ONE engine — the same
design → worker → verifier → final-verify pipeline and integration. They differ
only in the planner skill injected:
| Command | Planner skill | Produces |
|---|---|---|
diablo run <issue> |
master-plan |
an implementation plan from a ticket |
diablo refactor <area> |
improve-codebase-architecture |
a refactor plan for an area |
Refactor is human-initiated, never auto-detected — deciding "this is large enough
to refactor" is a human judgment. A refactor plan can surface new issues, which
flow back through to-issues → diablo run. Same engine, looped.
Progress
A run emits structured progress events through a ProgressPort to three sinks:
- stdout — a one-line status per discrete event (always on). While a step is in flight, an animated braille spinner with an elapsed timer is redrawn in place (carriage return, no newline), so a long step never looks hung.
progress.md— a LIVE tracker in the worktree's.plans/, updated every discrete event with per-stage status (TODO/IN_PROGRESS/DONE/HALTED), commit SHA, verdict, retries, and a Pending Todos list. Each stage's handoff note (the worker's carry-forward narrative: decisions, deferrals, gotchas) is folded into the same artifact — one file, no drift. Liveness heartbeats are ignored here (the tracker is structural, not a per-second sink).- Telegram — push notifications rendered as Telegram-HTML (the supported
<b>/<i>/<code>/<pre>/<a>subset, escaped for path/SHA/code-heavy content). Liveness is shown as a single live bubble edited in place and throttled to one edit per 15s (respecting the Bot API edit rate limit); a discrete event closes the bubble so the next heartbeat opens a fresh one. Enabled when a bot token and chat id both resolve, from either the environment (DIABLO_TELEGRAM_BOT_TOKEN/DIABLO_TELEGRAM_CHAT_ID) or the per-repo.diablo/telegram.jsonfile written bydiablo telegram setup— env wins per field, the two sources are mixable, and a partial config leaves Telegram off. No credentials are read fromdiablo.config.jsonor committed (the file is gitignored via the existing.diablo/rule).
Liveness
Long agent steps are otherwise silent for minutes. A heartbeat ticks while
each step runs, so every surface can show the run is alive and for how long.
Diablo streams Pi's --mode json stdout line by line and surfaces each
tool_execution_start as a short activity label, so the heartbeat reflects what
the agent is doing right now (for example editing run-step.ts, running `bun test` , or searching for "TODO") instead of a bare working. The label
falls back to the bare tool name for tools it doesn't recognise, and to
working before the first tool starts.
Liveness is push-only and best-effort: a failing sink (e.g. Telegram down) never
halts a run, and the activity stream is parsed defensively (a malformed line is
ignored, never thrown). Idle-vs-working is derived from the event stream
(waiting-for-approval = idle).
Two-way interactive approval over Telegram is intentionally not built:
unattended runs use gate: none (the verifier is the automated checkpoint), and
interactive runs already approve over stdin (gate: approval) from the terminal.
A Telegram approval gate would sit only in the narrow "AFK but still approving
by hand" gap, at the cost of an inbound poller, pairing/security, and
timeout/stale-tap handling — not worth it for that slice.
Develop
bun install
bun test
bun run typecheck