1.0.0 • Published 2d agoCLI

@forwardimpact/libharness

Licence

Apache-2.0

Version

1.0.0

Deps

Size

407 kB

Vulns

Weekly

Summary Dependency Versions

libharness

Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.

libharness provides the runtime and tool surface for multi-LLM coordination — an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives an asynchronous discussion — plus a CLI suite that runs evals, queries the traces they produce, and edits skill files under controlled conditions.

CLIs

CLI	Purpose
`fit-harness`	Run agents in `run`/`supervise`/`facilitate`/`discuss` subcommands.
`fit-trace`	Download, query, and analyze NDJSON traces produced by `fit-harness`.
`fit-benchmark`	Run task families for N runs each and aggregate pass@k.
`fit-selfedit`	Write stdin to `.claude/**` paths, gated by settings.json + branch.

fit-harness's subcommands share one orchestration loop and one async tool surface, below. The judge role is a profile passed to supervise.

Modes

Mode	Lead	Participants	Terminal tool
`run`	(none)	one agent	task completion
`supervise`	`supervisor`	one `agent`	`Conclude`
`facilitate`	`facilitator`	N named	`Conclude`
`discuss`	`lead`	N named	`Adjourn` or `Recess`
`judge`	`judge`	(none)	`Conclude`

run and judge are one-shot. The other three share OrchestrationLoop plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans messages out over an in-memory bus and emits a {source, seq, event} NDJSON envelope for every line.

Async Ask / Answer / Announce

Ask({ question, to? })       →  { askIds: [N, …] }
Answer({ message, askId? })  →  routed to the asker
Announce({ message })        →  broadcast, no reply expected

Every Ask returns immediately and registers a pending entry keyed by an askId. The reply arrives later on the asker's inbox as [answer#N] <participant>: <text>. Broadcast: omit to on a multi-participant lead. Answer's askId is optional — the handler is forgiving:

Provided + matches an ask owed by the caller → routes to that asker.
Provided but unknown or wrong addressee → isError with a pointed message.
Omitted + exactly one ask owed to the caller → auto-picks it.
Omitted + 0 or many asks owed → broadcasts as Announce.

Inbox lines on resume:

[ask#42]     facilitator: What is your current condition?
[answer#41]  agent-1:     We're at 7 out of 10.
[shared]     agent-2:     FYI I'm switching to Bun 1.2.
[system]     @orchestrator: You have an unanswered ask from facilitator (askId=42)…

Async means the lead can issue Asks, end its turn, and plan in the gap while participants work in parallel — nothing blocks the LLM thread.

Discuss-mode replies

In discussion mode, Answer calls routed to the lead are streamed to the discussion thread as they are produced — each agent's Answer becomes a separate reply posted immediately, not batched at session end. The lead and agents can also call Acknowledge to post brief messages directly to the thread (status updates, human follow-up responses). The message bus intercepts answers and appends them to ctx.replies[].

RequestForComment is a separate coordination tool available on agent roles (facilitated agents and discuss agents). It queues an intent to open a new Discussion thread for long-horizon coordination on open questions; these are accumulated in ctx.rfcs[], separate from the thread replies in ctx.replies[].

Orchestration loop

Each participant drains the bus (or waits), runs/resumes the LLM with drained messages as tagged lines, and on an unanswered owed Ask injects one synthetic reminder before emitting protocol_violation and unblocking the asker with a synthetic null answer.

Termination uses two flags. ctx.concluded is explicit Conclude/Adjourn/Recess — also cancels in-flight Asks so askers see why their question won't be answered. stopped is broader: lead error, agent crash, abort path. Loops watch stopped; ctx.concluded only feeds the summary's success/verdict.

Tool surface, by role

Role	Ask	Answer	Announce	RollCall	Conclude	Other
Facilitator	✓	✓	✓	✓	✓
Fac. agent	✓	✓	✓	✓		`RequestForComment`
Supervisor	✓	✓	✓	✓	✓
Sup. agent	✓	✓	✓	✓
Discuss lead	✓	✓	✓	✓		`Recess`, `Adjourn`, `Acknowledge`
Discuss agt	✓	✓	✓	✓		`RequestForComment`, `Acknowledge`
Judge					✓

Ask's to accepts a participant name on multi-participant roles (facilitator, discuss lead, all participants). The supervise pair has only one possible target so to is rejected there.

Minimal example: two-participant facilitator

import { createFacilitator, createRedactor } from "@forwardimpact/libharness";
import { query } from "@anthropic-ai/claude-agent-sdk";

const facilitator = createFacilitator({
  facilitatorCwd: process.cwd(),
  agentConfigs: [
    { name: "alice", role: "explorer", agentProfile: "alice" },
    { name: "bob",   role: "tester",   agentProfile: "bob" },
  ],
  query,
  output: process.stdout,
  redactor: createRedactor(),
  facilitatorProfile: "improvement-coach",
});

const result = await facilitator.run("Run a kata storyboard meeting.");
// result.success / result.turns / NDJSON trace on process.stdout

The facilitator gets Ask/Answer/Announce/RollCall/Conclude; each agent gets the same minus Conclude. Every tool call, bus message, and orchestrator event becomes one trace line.

Trace format and redaction

Each line is { "source": "<participant|orchestrator>", "seq": N, "event": {…} }. seq is monotonic across the whole trace; orchestrator emits session_start, agent_start, protocol_violation, lead_turn_limit, and summary. event is the SDK event verbatim or the orchestrator payload. fit-trace consumes this format.

Redaction is on by default for fit-harness run/supervise/facilitate and composes two layers:

Env-var allowlist — ANTHROPIC_API_KEY, GH_TOKEN, GITHUB_TOKEN by default; override with LIBHARNESS_REDACTION_ENV_VARS=NAME1,… (replaces, not extends). Runtime values become [REDACTED:env:NAME] everywhere they appear.
Credential-shape patterns — sk-ant-, ghp_, ghs_, gho_, github_pat_. Hits become [REDACTED:pattern:KIND].

Set LIBHARNESS_REDACTION_DISABLED=1 to disable (one stderr warning per run). Never on CI for a public repo — workflow artifacts are downloadable through retention.

Module map

Module	Purpose
`agent-runner.js`	One Claude Agent SDK session; emits NDJSON via the redactor.
`message-bus.js`	Per-participant queues + `waitForMessages` Promise wakeup.
`orchestration-toolkit.js`	Shared Ask/Answer/Announce/Conclude/RollCall/RequestForComment handlers + builders.
`orchestration-loop.js`	Unified lead+participant loop; reminder/violation handling.
`facilitator.js` / `supervisor.js` / `discusser.js` / `judge.js`	Per-mode class + factory + system prompt.
`discuss-tools.js`	Discuss-only `Recess`/`Adjourn`/`Acknowledge`.
`reply-emitter.js`	Fire-and-forget POST of reply/ack events to the callback URL.
`inbox-poller.js`	Long-poll the bridge inbox for injected human messages.
`trace-collector.js` / `trace-query.js` / `trace-github.js`	Trace ingestion / querying / GitHub-attachment helpers.
`redaction.js`	Env-var allowlist + credential-shape pattern redaction.

fit-selfedit

A narrow, audited bypass for sessions where Edit/Write (and bash writes) are blocked against paths the project's own allowlist permits. Reads stdin, writes the target, exits 0 / 2 (safeguard violation) / 1 (I/O error).

echo "<content>" | bunx fit-selfedit <path>

Two safeguards, checked in order:

Settings-allow. Walk upward from the target with Finder.findUpward to find the nearest .claude/settings.json. The target relative to its grandparent directory must match at least one Edit(<glob>) rule in permissions.allow[] (matched with minimatch, dot: true). Settings.json is the single source of truth — widen the project allowlist and the CLI follows. Traversal like .claude/../README.md is rejected as a side effect: path.resolve collapses .. first, then the resolved path tests against the rules.
Branch scope. git rev-parse --abbrev-ref HEAD must not be HEAD (detached) or main. Edits ride a feature branch through whatever merge gates the project has configured.

Failure messages name the safeguard that rejected; safeguard 1 also lists the Edit() rules that were tried.

Documentation

Agent Evaluations Guide — how to run an eval and read its trace.
Agent Collaboration Guide — supervise / facilitate / discuss in depth.
Trace Analysis Guide — analysing NDJSON traces with fit-trace.

Keywords

eval agent trace claude-code supervisor