Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.
AI red-teaming for the AI agents & LLM apps you own: stress-test them with adversarial models to find security failures (prompt injection, tool misuse / excessive agency, data leakage, jailbreaks, denial-of-wallet), auto-patch (blue team), retest, and get