swe-bench

swe-bench
Released
3 months ago
Version
0.1.0
@metaharness/darwin
Released
5d ago
Version
0.7.0
Freeze the model, evolve the harness. Two measured applications: (1) SWE-bench Lite code-repair — 7.7% open-loop -> 58.3% via cheap->frontier tiering (official swebench Docker, verified), ~$0.01-$0.74/instance vs $1-20 for frontier agents; (2) Darwin Shie
llm cost-optimization llm-optimizer cheap-llm compute-arbitrage agent-harness +6
@axplusb/kepler
Released
5d ago
Version
2.0.6
Kepler — AI coding agent with operating brief, preflight planning, and sub-agents. SWE-bench Lite evaluated.
kepler codekepler ai coding-agent swe-bench cli +6
oh-my-githubcopilot
Released
2 weeks ago
Version
2.0.0
Multi-agent orchestration for GitHub Copilot CLI. 19 agents, 59 skills, parallel execution, HUD, PSM, SWE-bench.
orchestration multi-agent autopilot swe-bench hud copilot-cli
lastlight-evals
Released
5h ago
Version
0.2.0
Eval harness for Last Light workflows — drives the real production workflows against a mocked GitHub and grades deterministically (SWE-bench compatible).
lastlight evals swe-bench agent workflow
@metaharness/weight-eft
Released
yesterday
Version
0.1.1
Fine-tune cheap open-source LLMs (GLM, Qwen, DeepSeek) on your AI coding agent's successful runs with LoRA (SFT + DPO) so your model cascade escalates to expensive frontier models (GPT, Claude) less often — cutting cost-per-resolved. Turns run history int
llm lora fine-tuning peft sft dpo +14
vexp-cli
Released
20h ago
Version
2.1.0
Vexp — Context Engine for AI Coding Agents. Pre-indexes your codebase into a dependency graph and delivers ranked context to any MCP-compatible agent. 58% lower cost per task, 90% fewer tool calls (SWE-bench Verified). Works with Claude Code, Cursor, Copi
vexp ai context-engine mcp coding-agent cursor +14
@wrongstack/bench
Released
20h ago
Version
0.275.1
Model-independent agentic benchmark harness for WrongStack (Aider polyglot + SWE-bench Verified) with deterministic graders and harness fingerprinting.