npm.io
0.37.0 • Published yesterdayCLI

smart-web-mcp

Licence
SEE LICENSE IN LICENSE
Version
0.37.0
Deps
10
Size
1.8 MB
Vulns
0
Weekly
7.7K

smart-web

Local MCP for agentic web retrieval.

smart-web bundles three tools in one stdio server:

  • smartfetch for direct URLs
  • smartsearch for discovery
  • smartcrawl for same-site multi-page traversal

It is designed for agents and MCP hosts, not for manual browser automation. The job is to return structured retrieval output first, then signal when a browser-task runtime would be a better next step.

If a page is login-gated, keep the login UI, session persistence, and capture flow in a companion runtime instead of moving that stateful behavior into the smart-web core.

If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with the exact input, tool arguments, observed output, expected behavior, and version. Skip obvious transient network, auth, or rate-limit failures unless the smart-web classification itself looks wrong.

Highlights

  • one local MCP instead of separate search, fetch, and crawl servers
  • explicit routing: URL → smartfetch, query → smartsearch, known site → smartcrawl
  • up-front smartfetch acquisition lanes instead of broad direct → browser retry chains, with pipeline.acquire.attempt_map showing seed/direct/relay/archive/browser phases and skip/failure reasons
  • optional Jina Reader relay for weak generic/article fetches when you explicitly enable it, with JSON mode and alternate-link preservation
  • shared document extraction for article-like pages keeps browser primary content separate from raw shell HTML, preserving headings, links, blocks, and Markdown when requested
  • docs export manifests are versioned and include per-document SHA-256 hashes plus resource-handle manifests for stable local Markdown mirrors; resource-handle manifests include both exported documents and metadata handles for the summary/manifest files, and successful export structuredContent embeds the same handle manifest for host follow-up calls
  • budgeted smartfetch projection records explicit truncation metadata instead of scattering silent hardcoded caps; MCP calls default to the compact preset for token-sensitive agent hosts
  • weak generic pages can recover through JSON-LD or Next.js payload extraction and same-origin RSS/Atom feed discovery before falling back to a handoff
  • optional local adaptive scraper fallback can call a Scrapling-compatible scrapling command for AI-targeted extraction on weak JavaScript shells before spending a full Playwright browser pass
  • paywalled article pages prefer lawful archive recovery and search-indexed reference previews before telling the host to escalate elsewhere
  • LinkedIn authwall handling prefers compliant public fallbacks such as archives and search-indexed reference metadata before giving up
  • legal paper fallback for academic URLs: DOI-aware OpenAlex, Unpaywall, Semantic Scholar, CORE discovery, and Europe PMC enrichment plus bioRxiv/medRxiv API fallback when a direct paper page is thin or blocked
  • site-native public search fallbacks cover Reddit, GitHub repository discovery/repo-docs queries, npm search/exact packages, Hacker News, Stack Exchange, Wikipedia, Velog, MDN, and crates.io when a matching site: query is available
  • YouTube watch, shorts, embed, live, and youtu.be video URLs get best-effort public caption enrichment by default while non-video YouTube pages stay on the generic fetch path
  • structured output with assessment, pipeline, and partial-result research_handoff fields for host-side decision making
  • local-first defaults with privacy-aware search/fetch behavior
  • useful normalization for common public surfaces instead of raw HTML dumps
  • social unavailable shells such as Threads generic join/login pages stay partial and chrome-free instead of masquerading as verified post content

Tool routing

  • smartfetch first when the user already gave a URL or shortlink
  • smartsearch when the user gave a topic, keywords, or a site: query
  • smartcrawl when you already know the site and need multiple relevant pages
Situation Tool
User already gave a URL or shortlink smartfetch
User gave a topic, keywords, or site: query smartsearch
You already know the site and need multiple pages smartcrawl

smart-web is a retrieval layer. It is not a general-purpose click/type/form-fill browser agent.[^1]

What the tools return

All three tools are shaped for agent consumption. Common high-signal fields include:

  • assessment: confidence, block/auth hints, recommended handoff
  • pipeline: resolve/acquire/normalize/assess stages, plus policy metadata for relay-style usage and authwall likelihood; pipeline.acquire.attempt_map describes the default seed, direct/impit, Jina Reader, archive, and browser phases as succeeded, failed, or skipped
  • research_handoff: compact metadata on blocked or partial smartfetch results telling downstream research workers and editable report renderers which evidence fields to consume
  • budget: output projection metadata showing which fields were truncated and by how much; use smartfetch output_budget: "compact" for token-sensitive MCP hosts, preferred_budget_chars for host-budget-aware routing, and "full" when larger structured payloads are explicitly needed
  • evidence.selected_output_budget*: the selected smartfetch budget preset, winning control, and resolver reason so hosts can see which v0.x compatibility hint actually took effect
  • errors: structured failure reasons
  • post, thread, comments: normalized content surfaces
  • assets and download: file-like outputs when present

This lets hosts decide whether deterministic retrieval was good enough or whether they should escalate to an auth-aware companion flow or a broader browser-task runtime.[^2]

For smartfetch, MCP structuredContent always carries the projected JSON result. MCP calls default to output_budget: "compact" so agent hosts do not ingest large page/thread/link payloads unless they ask for them. The human-readable content[0].text defaults to compact text instead of duplicating that JSON payload, with separate caps for long body text, result, comment, link, and asset sections. Omitted counts are reported in the text view while the projected fields remain available through structuredContent; set format: "json" only when a host needs JSON repeated in the text channel. Set output_budget: "balanced" or "full" only for follow-up calls that explicitly need more structured content. For host-side integration, prefer preferred_budget_chars; use context_mode only as a backward-compatible coarse alias when numeric hints are not supplied. If neither is present, headroom_tokens keeps compact output by default and uses balanced only when there is enough room. format: "markdown" keeps the compact response shape and requests Markdown extraction when available.

When a smartfetch result is blocked, partial, or low-confidence, structuredContent.research_handoff stays small and metadata-only. It points downstream tools at the source URL and the projected evidence bundle: pipeline, assessment, errors, assets, outbound_links, and the full projected structuredContent. Heavy research agents or editable report/rendering tools can consume that bundle downstream, but smart-web itself remains the unified retrieval kernel and does not spawn those tools.

For smartsearch, MCP calls now project both structuredContent and rendered text through the same compact/balanced/full budget controls. Compact search keeps enough ranked links for first-pass discovery while capping long snippets, notes, and search-to-crawl handoff items; contextMaxCharacters remains a final text-channel cap for legacy hosts. Use output_budget: "full" only for a known follow-up when the host really wants every returned snippet/result in structuredContent.

For smartcrawl, MCP summary calls now project both structuredContent and rendered text through compact/balanced/full budget controls, so broad docs/forum crawls do not flood agent context with every page, candidate, note, or error by default. output_dir export mode remains handle-first: successful export responses embed resource_handle_manifest in structuredContent, and export summaries include file:// re-open guidance for the summary, manifest, resource-handle JSON, and first Markdown document so agent hosts can resume from handles without parsing the full JSON first. The resource-handle manifest also carries ordered recommended_resources for hosts that need a ready-to-open summary, manifest, and first document without inferring role priority. Blocked exports label those paths as planned/unwritten and include retry guidance instead of presenting false file handles. Set format: "json" only for hosts that cannot consume structuredContent and intentionally need JSON repeated in the text channel.

Use budget presets this way:

  • compact: default first pass for agent hosts, link triage, social/thread previews, and "is this page enough?" checks.
  • balanced: follow-up when the first pass found the right page but omitted useful body, result, comment, or link detail.
  • full: explicit retrieval pass for a known relevant URL when the host can afford the larger structuredContent payload.

When compact output omits needed content, make a follow-up call for the same URL/query/start site with a larger budget control instead of asking the model to infer from the compact text. Prefer preferred_budget_chars when the host knows its available response budget, or use output_budget: "balanced" / "full" for manual retries.

Supported high-signal surfaces

  • communities: Reddit, DCInside, LinkedIn public posts with public fallback recovery, Algumon
  • search-native surfaces: Reddit, GitHub repository discovery and repo-docs path scopes, npm search and exact package path scopes, Hacker News, Stack Exchange, Wikipedia, Velog, MDN, crates.io
  • competitive-programming surfaces: solved.ac problem/search/profile routes, BOJ problem/workbook/user pages, Codeforces problem/profile pages, AtCoder task/contest pages, QOJ problem pages, Jungol problem pages
  • local map place pages: Naver Map, Kakao Map, and redirected naver.me place-share links
  • reference/article pages: NamuWiki, Wikipedia, Naver Blog, Tistory, Velog
  • media/social: YouTube video URLs with best-effort public transcripts, X, Threads, Instagram, Telegram
  • commerce pages: Amazon, Coupang, Danawa, Aladin, AliExpress
  • archives/reference wrappers: Wayback snapshots, archive.md lookup pages, arXiv abstract pages with direct paper links
  • academic paper surfaces: arXiv, PubMed, PMC, bioRxiv, medRxiv, plus legal DOI-linked OA copies surfaced from OpenAlex, Unpaywall, Semantic Scholar, CORE, and Europe PMC when a publisher page is thin or paywalled

When a source stays blocked or under-specified, smart-web prefers partial but honest output over pretending to have verified content.

Install

npx -y smart-web-mcp

When smartfetch or smartcrawl needs Playwright-backed page loading, smart-web checks whether the bundled Chromium revision is available. If the local Playwright browser cache is missing or outdated, smart-web warns at startup and will try npx playwright install chromium automatically on first browser use before surfacing a structured error.

Host setup

Claude Code
claude mcp add smart-web -- npx -y smart-web-mcp

Suggested tool-use profile: default to smartfetch without format and without a budget control for compact first-pass retrieval; retry with output_budget: "balanced" or "full" only when the user asks for deeper extraction. If Claude Code exposes a host response budget, send it as preferred_budget_chars.

Codex
codex mcp add smart-web -- npx -y smart-web-mcp

Or via ~/.codex/config.toml (or project-scoped .codex/config.toml):

[mcp_servers.smart-web]
command = "npx"
args = ["-y", "smart-web-mcp"]

Suggested tool-use profile: keep default compact text for normal browsing. For targeted follow-up retrieval, pass preferred_budget_chars from the remaining response/tool budget when available; otherwise pass one explicit output_budget preset.

OpenCode
{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "smart-web": {
      "type": "local",
      "command": ["npx", "-y", "smart-web-mcp"],
      "enabled": true
    }
  }
}

Suggested tool-use profile: call smartfetch with no format for normal retrieval, format: "markdown" only when Markdown extraction is useful, and format: "json" only for hosts that intentionally require the full JSON duplicated into content[0].text.

Hermes/Raon

For Hermes/Raon-style hosts, keep the MCP server command standard and set budget controls per tool call:

{
  "tool": "smartfetch",
  "arguments": {
    "url": "https://example.com/post",
    "preferred_budget_chars": 40000
  }
}

Use preferred_budget_chars as the primary host-profile control when Hermes/Raon knows the response budget. Use headroom_tokens only when the host has token headroom but no character budget, and use context_mode only for older integrations that cannot send numeric budget metadata.

Budget-control precedence is deterministic:

  1. output_budget
  2. preferred_budget_chars
  3. headroom_tokens
  4. context_mode
  5. compact default

Send only one control when possible. If older host profiles send more than one, evidence.selected_output_budget_source reports which control won.

Custom settings file

To use a non-default settings path, append --settings-file to the command:

npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.json

For Claude Code:

claude mcp add smart-web -- npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.json

For OpenCode, add "args" after "command":

{
  "mcp": {
    "smart-web": {
      "type": "local",
      "command": ["npx", "-y", "smart-web-mcp", "--settings-file", "/absolute/path/to/smart-web.settings.json"],
      "enabled": true
    }
  }
}

Configuration

Settings path

Default: ~/.config/smart-web/settings.json

Print a template:

npx -y smart-web-mcp --print-settings-example

Initialize the default file:

npx -y smart-web-mcp --init-settings

smart-web treats the settings file as the single runtime config surface.

Full settings reference
{
  // "balanced" (default) or "private"
  // "private" disables relay-style providers and public search helpers
  "profile": "balanced",

  "runtime": {
    // Optional override for staging/export temp files
    // Default: platform cache root (e.g. ~/.cache/smart-web/tmp)
    "tempDir": ""
  },

  "search": {
    // Self-hosted SearXNG instance URL
    "searxngBaseUrl": "",
    "enableSearxng": true
  },

  "fetch": {
    // Browser overrides — most users leave these empty
    "chromeChannel": "",
    "chromePath": "",
    // Auto-install Playwright Chromium when missing (default: true)
    "autoInstallPlaywright": true,
    // Jina Reader relay — opt-in third-party path for weak article pages
    "enableJinaReader": false,
    "jinaReaderBaseUrl": "https://r.jina.ai/",
    // Academic fallback — legal OA enrichment for paper URLs
    "enableAcademicFallback": true,
    "enableOpenAlex": true,
    "enableEuropePmc": true,
    "enableBiorxivApi": true,
    // Unpaywall — requires a contact email
    "enableUnpaywall": true,
    "unpaywallEmail": "",
    // Semantic Scholar — optional API key for higher-rate access
    "enableSemanticScholar": true,
    "semanticScholarApiKey": "",
    // CORE — optional API key for richer search-backed enrichment
    "enableCoreDiscovery": true,
    "coreApiKey": "",
    // FxTwitter — transparent x.com → fxtwitter redirect
    "enableFxTwitter": true,
    // Undetected-chromedriver — optional Python Selenium fallback for tough anti-bot pages
    "enableUndetectedChromedriver": true,
    "undetectedChromedriverPython": "python3",
    // Site-specific fetches
    "enableRedditJson": true,
    // YouTube video transcript enrichment for public captions (default: true)
    "enableYoutubeTranscript": true,
    // Archive fallback — Wayback and archive.md recovery
    "enableArchiveFallback": true,
    "enableWayback": true,
    "enableArchiveMd": true,
    // Optional local adaptive scraper fallback. This is disabled by default;
    // install Scrapling separately and opt in only on deployments that want an
    // extra no-secret local shell/bot-challenge extraction attempt before the
    // built-in Playwright fallback.
    "enableAdaptiveScraper": false,
    "adaptiveScraperCommand": "scrapling",
    "adaptiveScraperTimeoutMs": 20000,
    // Output projection budget for smartfetch responses. MCP calls use compact
    // by default unless the caller passes a budget control explicitly; this setting
    // controls core/CLI defaults and custom deployments.
    // preset: "compact" | "balanced" | "full"; numeric fields override the preset.
    "outputBudget": {
      "preset": "balanced",
      "maxPostTextChars": 24000,
      "maxPostMarkdownChars": 32000,
      "maxThreadItems": 40,
      "maxCommentItems": 40,
      "maxOutboundLinks": 120,
      "maxAssets": 30,
      "maxBlocks": 60,
      "maxBlockTextChars": 1000,
      "maxItemTextChars": 1200
    },
    // Optional default named browser profile for authenticated smartfetch calls.
    // Names are 1-64 characters: letters, numbers, dots, underscores, or hyphens;
    // they must start with a letter or number and cannot end with a dot.
    "browserProfile": ""
  },

  "network": {
    // Allow localhost/private/reserved fetch targets (default: false)
    // Applies to smartfetch, smartcrawl, and Korean public API helpers.
    "allowPrivateHosts": false
  }
}
Common setups
Default — no config file needed

Out of the box smart-web works with sensible defaults. Create a settings file only when you need to change something.

Minimal: profile only
{
  "profile": "balanced"
}
Private / local-first
{
  "profile": "private",
  "search": {
    "searxngBaseUrl": "http://localhost:8080"
  }
}

In private mode, relay-style provider requests are blocked, public no-key search fallbacks are disabled, and SearXNG bases must resolve to localhost or private addresses. Hosted API-key search providers such as Exa, Tavily, and Brave Search are intentionally not part of the default smartsearch runtime; use local/site-native search, DuckDuckGo/Brave HTML fallbacks, or self-hosted SearXNG instead.

Disable Playwright auto-install
{
  "fetch": {
    "autoInstallPlaywright": false
  }
}

The server will return an actionable playwright_browsers_missing or playwright_browsers_outdated error instead.

Opt in to Jina Reader for weak generic article pages
{
  "fetch": {
    "enableJinaReader": true
  }
}

This stays opt-in because it is a third-party relay path. When enabled, it only applies to weak generic/article direct fetches and prefers Jina JSON mode when that improves the page. profile: "private" hard-disables relay-style providers.

Relay eligibility is provider-policy based. Generic and document-like providers can opt in, while browser/auth/social/commerce providers stay off unless their provider declares support. pipeline.policy.relay_style_allowed and pipeline.policy.relay_style_used tell hosts whether a relay path was allowed and whether it was used.

Opt in to a local adaptive scraper command
{
  "fetch": {
    "enableAdaptiveScraper": true,
    "adaptiveScraperCommand": "scrapling",
    "adaptiveScraperTimeoutMs": 20000
  }
}

This lane is for local, no-secret deployments that install Scrapling or a compatible CLI themselves. It runs only after direct retrieval still looks weak, blocked, or shell-like and before the built-in browser fallback. The attempt is recorded as adaptive_scraper in pipeline.acquire.attempt_map; if the command is missing, times out, returns empty text, or does not improve the active result, smartfetch keeps the existing result and continues through the normal fallback chain.

Tune smartfetch output budgets
{
  "fetch": {
    "outputBudget": {
      "maxPostTextChars": 16000,
      "maxThreadItems": 25,
      "maxOutboundLinks": 80
    }
  }
}

Extraction runs before budgeting so providers can work from complete page content. The budget applies to returned smartfetch projections and records truncation under budget.fields.

Force archive fallback off
{
  "fetch": {
    "enableArchiveFallback": false
  }
}
{
  "fetch": {
    "enableUnpaywall": true,
    "unpaywallEmail": "research@example.com",
    "enableSemanticScholar": true,
    "enableCoreDiscovery": true
  }
}
Enable undetected-chromedriver from a dedicated virtualenv
{
  "fetch": {
    "enableUndetectedChromedriver": true,
    "undetectedChromedriverPython": "/absolute/path/to/venv/bin/python"
  }
}

This helper is optional. smart-web still works without it, but when installed and reachable it can act as a second browser engine for stubborn pages where Playwright alone is not enough.

Custom temp directory
{
  "runtime": {
    "tempDir": "/absolute/path/to/smart-web-tmp"
  }
}
Advanced overrides

These keys live inside settings.json when you need to force one provider on or off:

Search

  • search.enableSearxng
  • search.enableDuckDuckGo
  • search.enableBraveHtml

Fetch

  • fetch.enableFxTwitter
  • fetch.enableXOembed
  • fetch.autoInstallPlaywright
  • fetch.enableJinaReader
  • fetch.jinaReaderBaseUrl
  • fetch.enableAcademicFallback
  • fetch.enableOpenAlex
  • fetch.enableEuropePmc
  • fetch.enableBiorxivApi
  • fetch.enableUnpaywall
  • fetch.unpaywallEmail
  • fetch.enableSemanticScholar
  • fetch.semanticScholarApiKey
  • fetch.enableCoreDiscovery
  • fetch.coreApiKey
  • fetch.enableUndetectedChromedriver
  • fetch.undetectedChromedriverPython
  • fetch.enableRedditJson
  • fetch.enableYoutubeTranscript
  • fetch.enableArchiveFallback
  • fetch.enableWayback
  • fetch.enableArchiveMd
  • fetch.outputBudget.maxPostTextChars
  • fetch.outputBudget.maxPostMarkdownChars
  • fetch.outputBudget.maxThreadItems
  • fetch.outputBudget.maxCommentItems
  • fetch.outputBudget.maxOutboundLinks
  • fetch.outputBudget.maxAssets
  • fetch.outputBudget.maxBlocks
  • fetch.outputBudget.maxBlockTextChars
  • fetch.outputBudget.maxItemTextChars

Compatibility

  • network.localOnly: advanced override that forces local-only behavior regardless of profile

Provider benchmark

Use the provider benchmark when deciding whether to build, wrap, or replace retrieval lanes:

npm run benchmark:providers
npm run benchmark:providers:measure
npm run benchmark:providers:compare
npm run --silent benchmark:providers:json
npm run benchmark:providers:json-file
npm run benchmark:providers:report
npm run benchmark:providers:gaps
npm run benchmark:providers:live
npm run benchmark:providers:live:jina
npm run benchmark:providers:live:cp
npm run benchmark:providers:live:browser
npm run benchmark:providers:live:youtube
npm run benchmark:providers:live:blocked
npm run benchmark:providers:live:naver-blog
npm run benchmark:providers:live:naver-map
npm run benchmark:providers:live:naver-cafe
npm run benchmark:providers:live:x-twitter
npm run benchmark:providers:live:no-secret
npm run benchmark:providers:live:no-secret:gaps
npm run watchdog:no-secret:live
npm run benchmark:providers:trend
npm run benchmark:smartsearch
npm run --silent benchmark:smartsearch:json
npm run watchdog:no-secret
npm run watchdog:partial-handoff

The default report is availability-only and safe for CI. --run-measurements (npm run benchmark:providers:measure) adds deterministic fixture records for available local/core and optional no-key lanes: success, useful character counts, projected/text character counts, useful-to-projected and useful-to-text yield ratios, latency, extractor/provider id, confidence, partial status, error class, lane label/family/category metadata, provider-family/category lane-status rollups, surface coverage, and recommendation sections grouped by lane, case, and surface.

Use --compare-external (npm run benchmark:providers:compare) to add an externalComparisons section that compares current smart-web fixture evidence against optional no-secret/local candidates. The comparison includes the local Crawl4AI command lane, the public Jina Reader relay fixture, and available direct/Playwright baselines. If crawl4ai is not installed, the Crawl4AI row is skipped with action: "install-local-prerequisite" and prerequisite.command: "crawl4ai"; this is an expected setup signal, not a benchmark failure or required dependency.

Reports carry the provider-benchmark.v1 schema version plus non-fatal git and runtime metadata so archived JSON, Markdown, and NDJSON trend artifacts are parseable, traceable, and comparable as the harness evolves. Markdown, text, JSON, and NDJSON outputs include aggregate fixture yield averages, compact yield rollups by provider family/category/lane, and a best-current-evidence surface table that ranks measured/skipped candidates so review jobs can compare token efficiency and build-vs-wrap posture between core, browser, relay, and specialist paths without reprocessing the full measurement table.

Measurement artifacts also include a live-readiness manifest listing each lane's env/command/settings prerequisites, a deterministic fixture smoke, a safe MCP live-smoke candidate, and surfaces that still lack live evidence. --run-live-smokes (npm run benchmark:providers:live) is an explicit gated mode that runs only no-secret, core-surface live checks for static articles, long docs/Wikipedia, Hacker News, and docs export; optional no-secret lanes remain gated until explicitly allowlisted: --allow-jina-live-smoke (npm run benchmark:providers:live:jina) exercises the public Jina Reader relay with a secret-stripped settings file, --allow-competitive-programming-live-smoke (npm run benchmark:providers:live:cp) adds a public BOJ/Codeforces representative smoke for the competitive-programming specialist surface, --allow-browser-live-smoke (npm run benchmark:providers:live:browser) runs a Playwright --force-dynamic smoke for the JS-rendered article surface only when the local browser lane is available, --allow-youtube-live-smoke (npm run benchmark:providers:live:youtube) adds the public YouTube seed-lane representative smoke, --allow-blocked-authwall-live-smoke (npm run benchmark:providers:live:blocked) adds a deterministic HTTP Basic Auth challenge smoke that treats honest blocked/error/handoff signals as positive calibration without private credentials, --allow-naver-blog-live-smoke (npm run benchmark:providers:live:naver-blog) adds a stable public Naver Blog representative smoke for Korean specialist-lane calibration, --allow-naver-map-live-smoke (npm run benchmark:providers:live:naver-map) adds a stable public Naver Map place smoke with browser-preflight diagnostics, --allow-naver-cafe-live-smoke (npm run benchmark:providers:live:naver-cafe) adds a stable public Naver Cafe shared-link smoke that treats useful no-credential partial output as honest calibration, and --allow-x-twitter-live-smoke (npm run benchmark:providers:live:x-twitter) adds a public X/Twitter post smoke that keeps no-secret social-provider drift visible without private cookies. --allow-all-no-secret-live-smokes (npm run benchmark:providers:live:no-secret) enables every no-secret gate in one run; hosted API-key search lanes are excluded from the benchmark surface by product policy.

Live smoke artifacts record per-case present/missing/skipped evidence next to fixture evidence without printing secret values. When fixture rows exist but all are skipped by missing local commands, comparison and gap artifacts report fixture evidence as skipped with a fixture-skipped-* comparison instead of implying the fixture rows are absent. Browser-lane live smoke rows also include a secret-safe Playwright preflight diagnostic with executable path, launch timing, domcontentloaded navigation timing, HTTP status/final URL, and truncated failure reason so runtime/cache/network failures can be separated from smartfetch normalization failures.

The same JSON, text, Markdown, and NDJSON outputs include a compact live-vs-fixture comparison rollup so replacement reviews can distinguish calibrated surfaces from fixture-only, skipped-live, live-disagreeing, and fixture-skipped-by-prerequisite surfaces. Secret-safe provider diagnostics list prerequisite names with satisfied/missing status and never print env values. Use npm run benchmark:providers:json-file or add --json-file <path> to write the full JSON artifact without relying on stdout capture. Use npm run benchmark:providers:report or add --report-file <path> with measurement mode to write a Markdown review artifact containing lane availability, diagnostics, surface/corpus coverage, measurement highlights, yield comparisons, per-surface evidence, live readiness, live-smoke diagnostics, and build-vs-wrap recommendations without changing stdout or JSON behavior. Use npm run benchmark:providers:gaps or add --gap-file <path> with measurement mode to write a compact provider-benchmark-gaps.v1 JSON artifact containing only the ranked provider evidence gap queue, summary counts, git/runtime metadata, and safe smoke candidates. Use npm run benchmark:providers:trend or add --trend-file <path> with measurement mode to append compact NDJSON trend records with aggregate lane, prerequisite, measurement, recommendation, surface-evidence, and fixture-yield counts. Unknown --flag options fail fast so typoed benchmark jobs do not produce misleading artifacts. The corpus covers static articles, JS-rendered articles, blocked/authwall-like pages, long docs/Wikipedia pages, Naver Blog/Cafe/Map, Hacker News, YouTube, X/Twitter, BOJ/Codeforces, relay, and docs export surfaces. Optional local/no-secret lanes without commands remain explicit skips instead of silently failing; API-key provider candidates are excluded from the default corpus and reports by product policy.

Provider benchmark reports also include providerEvidenceGaps: a ranked next-action queue built from the live-vs-fixture rollup. Each gap records the surface, priority score, selected and candidate lanes, follow-up action, unlock prerequisites, fixture/live useful chars, fixture skipped-case lanes/reasons, live-smoke failure error classes/reasons, and safe live smoke candidate so replacement reviews can pick the next provider experiment without scanning every measurement row or re-running a timed live calibration. Use npm run benchmark:providers:live:no-secret:gaps to refresh only the compact all-no-secret gap artifact at reports/provider-benchmark-live-no-secret-gaps.json. When fixture rows exist but were skipped by missing local commands, the live-vs-fixture and gap actions are unlock-skipped-fixture-prerequisites rather than fixture-creation actions.

npm run benchmark:smartsearch is the deterministic quality gate for site-native smartsearch routing. It patches fetch, disables generic web-search fallbacks with a temporary settings file, and exercises representative site: queries for Reddit, GitHub repo discovery, GitHub repo docs, npm, Hacker News, Stack Exchange, Wikipedia, Velog, MDN, and crates.io without live network access. The human report is compact; npm run --silent benchmark:smartsearch:json emits smartsearch-quality-benchmark.v1 JSON with expected engine, top URL, URL-substring checks, actual engine, top URL, notes, and pass/fail status per case.

See reports/README.md for the curated report archive index. It identifies provider-benchmark-live-no-secret.{json,md} and provider-benchmark-live-no-secret-gaps.json as the current maintenance artifacts, and labels older cumulative live-smoke reports as historical calibration snapshots.

npm run watchdog:no-secret is the deterministic current-surface gate for no-secret maintenance. It fails when reports/provider-benchmark-live-no-secret-gaps.json contains current gaps, when runtime/docs/examples/scripts reintroduce hosted API-key provider markers for Exa, Tavily, Firecrawl, Browserless, or Brave Search API, or when the partial/handoff watchdog no longer has a compact text ceiling. It intentionally ignores historical CHANGELOG.md entries and archived reports/ output, and does not run live network smokes.

npm run watchdog:no-secret:live is the live regression gate for no-secret maintenance. It first refreshes reports/provider-benchmark-live-no-secret-gaps.json with npm run benchmark:providers:live:no-secret:gaps, then runs npm run watchdog:no-secret so non-empty live provider gaps fail locally or in CI. The scheduled/manual GitHub Actions workflow .github/workflows/no-secret-live-watchdog.yml runs the same alias weekly with read-only permissions, no secrets, and a short-retention gap artifact.

npm run watchdog:partial-handoff builds the package and runs a compact no-secret drift watchdog over representative partial/blocked/handoff smartfetch outputs (httpbin authwall, LinkedIn profile authwall, Naver Blog unavailable/deleted posts, Threads unavailable/join-login output, DCInside unavailable/deleted posts, Telegram missing/private posts, a non-existent Wikidocs article, Naver Cafe, Naver Map search, and BOJ unavailable/shutdown output). It exits non-zero when expected handoff cases are silently reclassified as non-partial, when partial/handoff content[0].text grows beyond the compact watchdog ceiling, or when compact text or structured links/assets leak chrome URLs, script globals, or style-shell tokens, and prints a partial-handoff-watchdog.v1 JSON report without raw page dumps or credentials. For runner artifacts without JSON stdout, run node scripts/partial-handoff-watchdog.mjs --report-file <path> after npm run build; the command keeps stdout human-readable and writes the full JSON report to the requested path. Fixture cases can set expectPartialHandoff: true to make classifier drift fail even if the output would otherwise be skipped.

Quick verification

Sanity-check the server with a few real calls:

  • smartfetch on a normal article URL
  • smartfetch on a Medium or other member-only article URL — confirm it returns either archive-backed content or an honest reference_only preview
  • smartfetch on a Naver Map, Kakao Map, or naver.me place-share URL
  • smartfetch on an arXiv abstract URL
  • smartfetch on a PubMed, PMC, or bioRxiv/medRxiv paper URL
  • smartfetch on a known product URL
  • smartsearch on a site: query
  • smartcrawl on a docs site or board you actually use

Reporting issues

If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with:

  • exact input and tool arguments
  • observed output
  • expected behavior
  • smart-web-mcp version

Skip obvious transient network, auth, or rate-limit failures unless the classification itself looks wrong.

License

All rights reserved. See LICENSE for terms.

This software is licensed under a proprietary license that permits installation and use as a local MCP server but prohibits modification, redistribution, reverse engineering, or use in competing products.

References

[^1]: Model Context Protocol, "Tools" specification — MCP tools are a retrieval and integration surface, not a requirement to emulate full browser-automation flows inside one tool.

[^2]: Model Context Protocol Blog, "Tool Annotations as Risk Vocabulary: What Hints Can and Can't Do" (2026-03-16) — structured tool output and accurate hints improve host-side routing and escalation decisions.

Keywords