Keyword: evals

eve
Released
12h ago
Version
0.17.0
Filesystem-first framework for durable backend AI agents that run anywhere.
agent-framework agents ai-agents ai-sdk evals eve +9
@mcpjam/sdk
Released
3d ago
Version
1.25.0
MCP server unit testing, end to end (e2e) testing, and server evals
mcp model-context-protocol testing evals unit-testing e2e
treetrace
Released
3d ago
Version
0.10.4
Catch every time your AI coding agent touches auth, secrets, or skips a test, then turn the correction you made into a local regression eval. Local-first, deterministic, no LLM judge.
security regression evals claude-code ai-agent agent +10
my-pi
Released
17h ago
Version
0.1.95
Composable pi coding agent with MCP, LSP, prompt presets, and local eval telemetry
cli coding-agent evals lsp mcp pi +2
@botpress/evals
Released
3d ago
Version
2.0.1
Evaluation definitions and runner for ADK-based Botpress agents
adk agent ai botpress evals evaluation
@roleplay-sh/cli
Released
6d ago
Version
0.1.12
Included local runner for roleplay.sh social-engineering tests.
ai agents security social-engineering evals testing +2
@spences10/pi-context
Released
1 week ago
Version
0.1.2
Searchable local SQLite sidecar that keeps oversized Pi tool output useful without flooding model context
context evals fts5 pi pi-package sqlite
@plasius/ai-evals
Released
yesterday
Version
0.1.9
Golden datasets, scorecards, and cost-quality evaluation contracts for Plasius AI workloads.
ai evals scorecards quality cost plasius
@asymmetric-ai/cli
Released
yesterday
Version
0.2.0
Spin up high-fidelity local SaaS clones for AI agents and evals.
ai agents evals saas clone sandbox +3
@themobiusstrip/coble
Released
6d ago
Version
0.5.1
A local, provider-agnostic agent CLI — LangGraph.js core, Ink TUI, durable sessions, human-in-the-loop approvals, built-in evals.
agent cli langgraph ink llm evals
@percepta/kaizen
Released
4d ago
Version
0.10.0
Automated AI researcher that improves AI systems
ai cli evals kaizen langfuse
@syntaxname/verdictci
Released
51m ago
Version
0.1.1
Fail pull requests when AI behavior regresses.
ai llm evals ci github-actions rag +3
@cyanheads/evals-mcp-server
Released
yesterday
Version
0.1.2
Author verifiable eval records through a draft → review → revise → submit loop with server-enforced graders; compile to JSONL/CSV/Inspect/lm-eval via MCP. STDIO or Streamable HTTP.
mcp mcp-server model-context-protocol typescript bun stdio +9
@hasna/evals
Released
yesterday
Version
0.1.30
Open source AI evaluation framework — LLM-as-judge + assertion-based evals for any AI app. CLI + MCP server.
evals llm ai testing evaluation mcp +4
lastlight-evals
Released
5h ago
Version
0.2.0
Eval harness for Last Light workflows — drives the real production workflows against a mocked GitHub and grades deterministically (SWE-bench compatible).
lastlight evals swe-bench agent workflow
@memoturn/sdk
Released
2d ago
Version
0.1.0
memoturn JS/TS SDK — tracing, OpenAI wrapper, LangChain callback, prompt fetch.
memoturn llm observability tracing openai langchain +2
@nearform/tracebound
Released
3d ago
Version
0.1.0
Tracebound CLI: deterministic primitives for the Tracebound agent-improvement loop.
cli llm agent traces observability evals +1
@verica-app/cli
Released
3d ago
Version
0.1.7
Run a Verica eval from CI and block the merge on the result.
verica evals llm ci prompt testing
@use-lightcurve/sdk
Released
3d ago
Version
0.2.2
Lightcurve human-grounded voice dataset API client and runtime instrumenter.
lightcurve voice-ai evals testing datasets pipecat
@eigenpal/cli
Released
4d ago
Version
0.10.4
Eigenpal CLI — author and run document workflows from your terminal. Agent-ready.
eigenpal cli ai agent workflow llm +5
@mastra/core
Released
39m ago
Version
1.47.0
ai llm llms agent agents vectorstore +8