Evaluation infrastructure for the swarmkit ecosystem — (harness x model x task x arm x seed) agent evals with ground-truth scoring, cost-matched Pareto reporting, and scalable parallel execution.
Typed TypeScript client for the Jetty AI/ML workflow platform
Mathematical expression evaluator
WebAssembly-based Redis Lua 5.1 script engine for Node.js - Execute Redis-compatible Lua scripts without a live Redis server
Agent evaluation framework — prove whether agent changes improved outcomes with reproducible evidence.
promptfoo extension for writing AI evaluations for Twilio AI Assistants
J-Rig seven-layer binary eval CLI for Claude Skills — the j-rig command: package integrity, trigger/functional/regression/baseline scoring, optimizer, and rollout-gate evidence. Self-contained (bundles the internal eval engine).
Skill Refiner orchestrator + I/O adapters + CLI: content-addressed store, j-rig score() shell-out adapter, tiered Anthropic propose() adapter, and the j-rig refine commands. Wraps the pure @intentsolutions/refiner-core.
Skill Refiner pure core: bounded-edit apply transform, deterministic synthetic eval-set bootstrap, the Pareto-dominant acceptance gate (DR-028 P0-RATIFY-1), and the swappable RefinerStrategy interface (AC-13).
Evaluation harness for Eidentic agents — scorers, LLM-as-judge, dataset management, CI pass-rate gate, and production trace promotion.
Memory benchmark harness for Eidentic — run LongMemEval / LoCoMo / temporal-reasoning benchmarks with deterministic recall metrics.
Multi-format rendering of synthetic evaluation data — validate fixtures before they enter the eval pipeline.
Evaluation harness for Alef agents — OTel span collection, RunMetrics, Pi-parsable reports
Test how an AI host routes real user intents to your MCP server's tools — catch silent mis-routing before your users do.