1.4.1 • Published yesterdayCLI

canary-lab

Licence

MIT

Version

1.4.1

Deps

Size

8.9 MB

Vulns

Weekly

Stars

Install scriptsThis package runs scripts during installation (preinstall/install/postinstall)

Summary Dependency Versions

Canary Lab

Your AI agent fixes the code. Canary Lab proves it works.

Canary Lab is the independent harness on your machine: it boots your real services in dev mode, runs your Playwright tests, and keeps the evidence (logs, traces, screenshots, videos). Your agent reads the failure, fixes the code, and signals a rerun — Canary Lab keeps going until it's green. The harness runs the tests and writes the record, so a green run means it actually passed.

Canary Lab end-to-end: an AI agent scaffolds a Checkout test suite, checks requirement coverage (47%), authors more tests to reach 100%, runs the suite green (12/12), and exports a verified evaluation report

Contents

What's New
How the Repair Loop Works
What You Write
Why a Harness? Your Agent Already Has a Terminal
Canary Lab and docker-compose
How It Compares
Quick Start
Agent-First Workflow
What Canary Lab Owns
Requirements
Limitations
Documentation
License

How the Repair Loop Works

Canary Lab applies the selected envset and starts your local services.
Playwright runs the feature tests.
Logs, screenshots, traces, videos, summaries, and failure slices land under logs/runs/<runId>/.
Your AI agent reads the failure context, fixes the app or the test, and signals rerun or restart — Canary Lab, not the agent, reruns the tests.
Canary Lab continues from the same run until the check passes.

What You Write

A feature is a folder with two things: a config for booting your services, and normal Playwright tests — no new test language to learn.

The config is where per-run isolation comes from: you describe the dev command you already run, and Canary Lab assigns a free port per run.

// features/checkout/feature.config.cjs
const config = {
  name: 'checkout',
  envs: ['local'],
  repos: [{
    name: 'checkout',
    localPath: __dirname,
    startCommands: [{
      command: 'npm run dev',
      // Canary Lab allocates a free port per run and injects it as PORT, so two
      // runs of this service never collide. Reference it anywhere as ${port.api}.
      ports: [{ name: 'api', env: 'PORT' }],
      healthCheck: { http: { url: 'http://localhost:${port.api}/', timeoutMs: 3000 } },
    }],
  }],
  featureDir: __dirname,
}

module.exports = { config }

The tests are ordinary Playwright. The only Canary Lab-specific line is the import — a thin fixture that tags each test's output so failures map back to the right test:

// features/checkout/e2e/checkout.spec.ts
import { test, expect } from 'canary-lab/feature-support/log-marker-fixture'

test('applying SAVE10 produces a 10% discount on the summary', async ({ request }) => {
  const { orderId } = await (await request.post('/order')).json()
  await request.post(`/order/${orderId}/items`, { data: { sku: 'X', qty: 1, price: 100 } })
  await request.post(`/order/${orderId}/coupon`, { data: { code: 'SAVE10' } })
  const summary = await (await request.get(`/order/${orderId}/summary`)).json()
  expect(summary.discount).toBe(10)
})

The scaffold ships sample features (some intentionally broken) so you can watch a full repair loop before writing your own.

Why a Harness? Your Agent Already Has a Terminal

A good coding agent can start a dev server and run Playwright itself. Three things it can't do alone:

Concurrency without conflicts. Every run gets its own ports (filled into commands, health checks, and env files via ${port.api}) and a git worktree per shared repo; extras queue. Several agents share one laptop safely.
Results it can't fake. The agent only reads results and asks for a rerun — the harness runs the tests and owns the verdict.
Safe env switching. Env files are backed up before changes and restored when the run ends.

Canary Lab and docker-compose

They work together. Compose runs services as images, so a one-line fix waits on a rebuild — Compose Watch helps, but only once you maintain a dev image and watch rules per service. Canary Lab runs the dev commands you already use (npm run dev, ./gradlew bootRun): hot reload picks up the fix in seconds, no Dockerfile, and per-run ports let several runs share one machine.

Compose is still better for databases and queues (Postgres, Redis, Kafka) and CI. Use both: docker compose up postgres redis in a Canary Lab startCommand for infrastructure, Canary Lab for your app services in dev mode.

How It Compares

Where Canary Lab's niche — an AI agent repairing local, multi-service e2e tests — sits next to the alternatives:

	Plain Playwright	docker-compose (watch)	Hosted test dashboard	Canary Lab
Runs your existing dev commands, hot reload intact	✓	needs a dev image + watch rules	—	✓
Fix → retest in seconds, no image rebuild	✓ (one service)	after rebuild/sync	—	✓
Boots & orchestrates several services together	you script it	✓	varies	✓
Concurrent runs on one machine (auto ports + git worktrees)	manual	not out of the box	hosted, not local	✓
Per-run evidence the agent can't fake	—	—	✓ (in the cloud)	✓ (on your machine)
Env-file switching with backup/restore	manual	manual	—	✓
Runs fully local / offline	✓	✓	—	✓

Canary Lab earns its place when a failure depends on more than a browser assertion — which services were up, which env was active, what the backend logged — and you want an agent to fix it unattended. Skip it when npx playwright test already tells you enough, when you want self-healing locators, when you don't need service orchestration or env switching, or when you'd rather a hosted dashboard manage your tests.

Quick Start

Create a workspace and open the local UI:

npx canary-lab init my-lab
cd my-lab
npx canary-lab ui

init scaffolds a workspace with sample features, installs dependencies, downloads the Playwright browser, and registers your AI agent's tools — so canary-lab ui opens the UI at http://localhost:7421 straight away. Add --no-open to skip the browser.

Prefer to install yourself (CI / offline)? Pass --no-install to init, then run the steps manually:

npx canary-lab init my-lab --no-install
cd my-lab
npm install
npm run install:browsers
npx canary-lab ui

The UI and MCP server share one port (default 7421). Pin another with --port 8200 on init, or change it later in Project Settings — Canary Lab restarts on the new port, and your MCP client may need to reconnect.

Restart your AI agent after setup so it discovers the Canary Lab tools. If they don't appear, run npx canary-lab setup --force and start a fresh agent session.

Agent-First Workflow

Canary Lab is built for an agent to drive: an MCP client writes tests, starts or claims runs, reads failure context, fixes code, and signals the next action. Canary Lab stays the run monitor; diagnosis and edits happen in the agent.

From your workspace, just ask:

/canary-lab run checkout locally, fix it if it fails, and run it again until it passes

You can also start runs by hand from the UI, and custom clients can drive the local HTTP API.

What Canary Lab Owns

Canary Lab keeps a narrow boundary: no test language, assertion model, or browser runner — Playwright runs the tests. Canary Lab owns the context around them:

Feature scaffolding and conventions; envset apply/cleanup.
Service startup, health checks, PTY streams, shutdown — with per-run port and git-worktree isolation for concurrent runs.
Run manifests, logs, artifacts, failure slices, summaries, and diagnosis journals.
Rerun/restart signals after a fix.

Requirements

Node.js >= 20 and npm >= 9.
A modern browser: Chrome, Firefox, or Safari.
Local UI server on http://localhost:7421 (the default; set per project via --port or Project Settings), with service orchestration through node-pty.
Optional repair agents: supported AI agent CLIs (claude, codex) on PATH.

node-pty is a native module that gives each service a real terminal (PTY), so interactive dev servers behave as they do in your own shell. It ships prebuilt binaries — a normal install compiles nothing. The one postinstall step (fix-node-pty-permissions.mjs) re-adds the execute bit to node-pty's spawn-helper, which its tarball drops on some platforms (upstream packaging bug): a chmod scoped to node_modules/node-pty, a silent no-op on Windows or if node-pty isn't installed.

Limitations

Repairs are only as good as your service logs.
Envset runs overwrite target files while active. If the process is killed mid-backup or mid-restore, reopen the UI and use the envset controls to recover.
Envset values aren't validated — stale config can surface as unclear test failures.
The Linux and Windows workflows aren't polished yet.

Documentation

Doc	What's inside
Changelog	What changed in each release.
Guide	Environment switching, run-output layout, repairing a failed run, evaluation reports, and external authoring.
Commands	Full CLI reference for every `canary-lab` subcommand.
Feature Folders	Feature structure, scaffold conventions, and creating a feature.
Architecture	Module map, run lifecycle, concurrency, heal system, and the MCP layer.
Contributing	Code orientation and the build/test workflow.

License

MIT