libharness
Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.
libharness provides the runtime and tool surface for multi-LLM coordination —
an agent talks to a supervisor, a facilitator chairs a meeting, a lead drives
an asynchronous discussion — plus a CLI suite that runs evals, queries the
traces they produce, and edits skill files under controlled conditions.
CLIs
| CLI | Purpose |
|---|---|
fit-harness |
Run agents in run/supervise/facilitate/discuss subcommands. |
fit-trace |
Download, query, and analyze NDJSON traces produced by fit-harness. |
fit-benchmark |
Run task families for N runs each and aggregate pass@k. |
fit-selfedit |
Write stdin to .claude/** paths, gated by settings.json + branch. |
fit-harness's subcommands share one orchestration loop and one async tool
surface, below. The judge role is a profile passed to supervise.
Modes
| Mode | Lead | Participants | Terminal tool |
|---|---|---|---|
run |
(none) | one agent | task completion |
supervise |
supervisor |
one agent |
Conclude |
facilitate |
facilitator |
N named | Conclude |
discuss |
lead |
N named | Adjourn or Recess |
judge |
judge |
(none) | Conclude |
run and judge are one-shot. The other three share OrchestrationLoop
plus an async Ask/Answer/Announce/RollCall tool surface; the loop fans
messages out over an in-memory bus and emits a {source, seq, event}
NDJSON envelope for every line.
Async Ask / Answer / Announce
Ask({ question, to? }) → { askIds: [N, …] }
Answer({ message, askId? }) → routed to the asker
Announce({ message }) → broadcast, no reply expected
Every Ask returns immediately and registers a pending entry keyed by an
askId. The reply arrives later on the asker's inbox as [answer#N] <participant>: <text>. Broadcast: omit to on a multi-participant
lead. Answer's askId is optional — the handler is forgiving:
- Provided + matches an ask owed by the caller → routes to that asker.
- Provided but unknown or wrong addressee →
isErrorwith a pointed message. - Omitted + exactly one ask owed to the caller → auto-picks it.
- Omitted + 0 or many asks owed → broadcasts as Announce.
Inbox lines on resume:
[ask#42] facilitator: What is your current condition?
[answer#41] agent-1: We're at 7 out of 10.
[shared] agent-2: FYI I'm switching to Bun 1.2.
[system] @orchestrator: You have an unanswered ask from facilitator (askId=42)…
Async means the lead can issue Asks, end its turn, and plan in the gap while participants work in parallel — nothing blocks the LLM thread.
Discuss-mode replies
In discussion mode, Answer calls routed to the lead are streamed to
the discussion thread as they are produced — each agent's Answer becomes
a separate reply posted immediately, not batched at session end. The
lead and agents can also call Acknowledge to post brief messages
directly to the thread (status updates, human follow-up responses).
The message bus intercepts answers and appends them to ctx.replies[].
RequestForComment is a separate coordination tool available on agent
roles (facilitated agents and discuss agents). It queues an intent to
open a new Discussion thread for long-horizon coordination on open
questions; these are accumulated in ctx.rfcs[], separate from the
thread replies in ctx.replies[].
Orchestration loop
Each participant drains the bus (or waits), runs/resumes the LLM with
drained messages as tagged lines, and on an unanswered owed Ask injects
one synthetic reminder before emitting protocol_violation and
unblocking the asker with a synthetic null answer.
Termination uses two flags. ctx.concluded is explicit
Conclude/Adjourn/Recess — also cancels in-flight Asks so askers
see why their question won't be answered. stopped is broader: lead
error, agent crash, abort path. Loops watch stopped; ctx.concluded
only feeds the summary's success/verdict.
Tool surface, by role
| Role | Ask | Answer | Announce | RollCall | Conclude | Other |
|---|---|---|---|---|---|---|
| Facilitator | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Fac. agent | ✓ | ✓ | ✓ | ✓ | RequestForComment |
|
| Supervisor | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Sup. agent | ✓ | ✓ | ✓ | ✓ | ||
| Discuss lead | ✓ | ✓ | ✓ | ✓ | Recess, Adjourn, Acknowledge |
|
| Discuss agt | ✓ | ✓ | ✓ | ✓ | RequestForComment, Acknowledge |
|
| Judge | ✓ |
Ask's to accepts a participant name on multi-participant roles
(facilitator, discuss lead, all participants). The supervise pair has
only one possible target so to is rejected there.
Minimal example: two-participant facilitator
import { createFacilitator, createRedactor } from "@forwardimpact/libharness";
import { query } from "@anthropic-ai/claude-agent-sdk";
const facilitator = createFacilitator({
facilitatorCwd: process.cwd(),
agentConfigs: [
{ name: "alice", role: "explorer", agentProfile: "alice" },
{ name: "bob", role: "tester", agentProfile: "bob" },
],
query,
output: process.stdout,
redactor: createRedactor(),
facilitatorProfile: "improvement-coach",
});
const result = await facilitator.run("Run a kata storyboard meeting.");
// result.success / result.turns / NDJSON trace on process.stdoutThe facilitator gets Ask/Answer/Announce/RollCall/Conclude;
each agent gets the same minus Conclude. Every tool call, bus
message, and orchestrator event becomes one trace line.
Trace format and redaction
Each line is { "source": "<participant|orchestrator>", "seq": N, "event": {…} }. seq is monotonic across the whole trace; orchestrator emits
session_start, agent_start, protocol_violation, lead_turn_limit,
and summary. event is the SDK event verbatim or the orchestrator
payload. fit-trace consumes this format.
Redaction is on by default for fit-harness run/supervise/facilitate
and composes two layers:
- Env-var allowlist —
ANTHROPIC_API_KEY,GH_TOKEN,GITHUB_TOKENby default; override withLIBHARNESS_REDACTION_ENV_VARS=NAME1,…(replaces, not extends). Runtime values become[REDACTED:env:NAME]everywhere they appear. - Credential-shape patterns —
sk-ant-,ghp_,ghs_,gho_,github_pat_. Hits become[REDACTED:pattern:KIND].
Set LIBHARNESS_REDACTION_DISABLED=1 to disable (one stderr warning per
run). Never on CI for a public repo — workflow artifacts are
downloadable through retention.
Module map
| Module | Purpose |
|---|---|
agent-runner.js |
One Claude Agent SDK session; emits NDJSON via the redactor. |
message-bus.js |
Per-participant queues + waitForMessages Promise wakeup. |
orchestration-toolkit.js |
Shared Ask/Answer/Announce/Conclude/RollCall/RequestForComment handlers + builders. |
orchestration-loop.js |
Unified lead+participant loop; reminder/violation handling. |
facilitator.js / supervisor.js / discusser.js / judge.js |
Per-mode class + factory + system prompt. |
discuss-tools.js |
Discuss-only Recess/Adjourn/Acknowledge. |
reply-emitter.js |
Fire-and-forget POST of reply/ack events to the callback URL. |
inbox-poller.js |
Long-poll the bridge inbox for injected human messages. |
trace-collector.js / trace-query.js / trace-github.js |
Trace ingestion / querying / GitHub-attachment helpers. |
redaction.js |
Env-var allowlist + credential-shape pattern redaction. |
fit-selfedit
A narrow, audited bypass for sessions where Edit/Write (and bash
writes) are blocked against paths the project's own allowlist permits.
Reads stdin, writes the target, exits 0 / 2 (safeguard violation) / 1
(I/O error).
echo "<content>" | bunx fit-selfedit <path>Two safeguards, checked in order:
Settings-allow. Walk upward from the target with
Finder.findUpwardto find the nearest.claude/settings.json. The target relative to its grandparent directory must match at least oneEdit(<glob>)rule inpermissions.allow[](matched withminimatch,dot: true). Settings.json is the single source of truth — widen the project allowlist and the CLI follows. Traversal like.claude/../README.mdis rejected as a side effect:path.resolvecollapses..first, then the resolved path tests against the rules.Branch scope.
git rev-parse --abbrev-ref HEADmust not beHEAD(detached) ormain. Edits ride a feature branch through whatever merge gates the project has configured.
Failure messages name the safeguard that rejected; safeguard 1 also
lists the Edit() rules that were tried.
Documentation
- Agent Evaluations Guide — how to run an eval and read its trace.
- Agent Collaboration Guide — supervise / facilitate / discuss in depth.
- Trace Analysis Guide — analysing NDJSON traces with
fit-trace.