turbo-surf
Native-speed, browserless web crawler + MCP server for AI agents — built on turbo-dom. Fetch + parse + extract
- run page JS with no headless browser; 100×+ faster on server-rendered pages.
turbo-surf is a single native (Rust) engine — no Chromium, no pixels, no layout:
- A crawler — point it at a domain and stream page records: indexed interactive elements, a link/form graph, an accessibility tree, markdown and plain-text views, rendered-HTML capture, CSS/XPath node queries, and schema-driven structured extraction.
- An agent tool — a 60-tool MCP server agents drive directly over stdio
(
crawl,batch, navigate, click/fill/submit, query, extract, accessibility tree, markdown,render/eval_js/inject_js, cookies/headers,snapshot).
For pages that need JavaScript it runs their scripts — either by mining server-embedded hydration state or by executing page JS inside a true V8 isolate (no browser) and re-rendering the DOM. Its page API is Playwright-shaped: the benchmark suite drives the engine with unmodified Playwright scripts.
What makes it different
Most tools in this space pick one lane: a crawler or a browser-automation library, and they get their DOM from a real browser (Playwright/Puppeteer/ Selenium) or an in-process fake DOM with no security isolation (jsdom, happy-dom). turbo-surf is unusual on four axes at once:
- AI-agent-ready out of the box. It ships a full MCP server (60 tools:
navigate, click/fill/submit, query, extract, accessibility tree, markdown,
crawla whole site,batcha URL list,render/set_modeto run page JS,eval_js/inject_jsagainst the live render heap with a DOM-history trail, cookies/headers,snapshot, …) so an agent drives real pages over stdio with no browser and no glue code —npx turbo-surf-mcp. Most crawlers are libraries you wrap yourself; this one is an agent tool on day one. - Crawler + agent surface on one native engine. The same engine bulk-crawls a domain and serves the MCP tools — no browser anywhere in the stack. Its page API is Playwright-shaped (the benchmark harness runs unmodified Playwright routines on it).
- Its own DOM, not a browser's. turbo-dom is a native + WASM HTML parser with a lazy copy-on-write DOM — native-speed parse, no pixels/layout/IPC.
- A V8 isolate to run page JS + re-render. Page (or your own) JavaScript runs
inside a real V8 isolate (a
deno_coreruntime — host heap unreachable from the guest, with a runaway-execution budget) against the native rtdom DOM, then re-renders. Most JS-capable crawlers instead drive a full headless browser, or run page scripts in-process with a fake DOM that offers no real security isolation (Node'svmis explicitly not a security boundary; cf. happy-dom CVE-2025-61927). Running hostile page JS in a true isolate against a lightweight DOM is rare.
See CHANGELOG.md for what shipped and rust/README.md for the engine internals.
Status: v0.2.7 — working (npm).
A native Rust engine (7-crate workspace on the turbo-dom crate): hardened
networking (cookies / document.cookie bridge / robots + crawl-delay / charset /
size + redirect caps, HTTP/2 + a pooled client, 304 conditional cache), crawl
orchestration (global + per-host concurrency, token-bucket politeness, backoff/retry,
canonical dedupe, depth/page caps), structured extraction, CSS+XPath query, a
no-Chromium JS render tier (a true V8 isolate over the native DOM) with re-enterable
live-heap eval_js/inject_js + a DOM-history trail, and a 60-tool MCP server
(native binary). Benchmarked against real browsers + other crawlers (below).
Install
The npm package is a thin launcher that spawns the native binary:
npm install -g turbo-surf # provides the `turbo-surf-mcp` command
# …or run without installing:
npx -y turbo-surf-mcpNode ≥ 20 to launch; the engine is a prebuilt per-platform native binary (no Node runtime hosts it).
What ships where
turbo-surf publishes from one v* git tag to two registries (see
PUBLISHING.md):
| Artifact | Registry | What it is | For |
|---|---|---|---|
turbo-surf |
npm | A thin launcher (cli.js/index.js) + prebuilt turbo-surf-mcp binaries for each platform in bin/ (darwin x64/arm64, linux x64/arm64-gnu, win x64). npx resolves the right one and spawns it. |
Running the MCP server / CLI. No Rust toolchain, no Chromium, no Node hosting Rust. |
turbo-surf |
PyPI | A PyO3 binding, shipped as prebuilt abi3 wheels (CPython 3.8+) for Linux (x86_64/aarch64), macOS (arm64), Windows (x64). | Calling the engine from Python — markdown/text/links/extract/render/…. No Rust toolchain, no Chromium. |
turbo-surf-core, -view, -page, -render, -mcp |
crates.io | The Rust crates, in dependency order. | Embedding the engine in a Rust program. |
The turbo-surf-napi cdylib and turbo-surf-transform crate are not
published — napi is for the dev harness + the Playwright shim only. The Playwright
shim (rust/playwright-shim/) is a dev/in-repo tool, not an npm artifact.
MCP server (agents)
npx turbo-surf-mcp # stdio MCP server (60 tools), e.g.:
# navigation: goto, go_back, go_forward, reload, set_user_agent
# content: interactive_elements, accessibility_tree, aria_snapshot, markdown,
# text, html, links, requests, snapshot, query, get_by,
# hydration_state, extract
# interaction: click, fill, submit, click_selector, fill_selector, select_option,
# check, uncheck, fill_many, find_text, forms, extract_links
# accessors: get_attribute, text_content, inner_html, input_value, count,
# is_visible, is_checked, is_enabled, is_editable, is_focused,
# is_empty, aria_role, accessible_name, accessible_description
# bulk: crawl, batch
# render/JS: render, set_mode, eval_js, inject_js, latest_dom, dom_history,
# evaluate, detect_js, run_playwright, probe
# stealth: stealth_status, set_fingerprint, analyze_akamai
# session: get_cookies, set_cookie, set_extra_headers, robots_check
# direct: fetch_json, fetch_rawrender/set_mode switch the Page into the JS render tier (a true V8 isolate over
the native DOM); then eval_js and inject_js run against the live render heap
(page globals, handlers) and each mutation appends to a DOM-history trail readable
via latest_dom/dom_history.
run_playwright executes a Playwright-style script directly — page /
locator / getBy* / expect (with .not) / test() blocks — over the engine,
no browser. Args: { script, url?, testIdAttribute? } (a url navigates first,
honoring set_mode so an SPA hydrates). It returns { ok, ran, logs } or
{ ok:false, error, logs } — a failed assertion is a result, not a tool error:
// tools/call run_playwright
{ "url": "https://example.com", "testIdAttribute": "data-test-id",
"script": "await expect(page.locator('h1')).toHaveText('Example Domain');\nawait page.fill('#q','rust');\nawait expect(page.locator('#q')).toHaveValue('rust');" }Set it up in Claude Code (step by step)
Never set up an MCP server before? This is the whole thing — three commands.
1. Check the prerequisites. You need Node ≥ 20
(node -v) and the Claude Code CLI
(claude --version). That's it — no Rust, no Chromium, no build step. npx
downloads a small launcher plus the prebuilt native binary for your OS the first
time you run it.
2. Register the server. Run this once (the -- separates Claude's flags from
the command Claude will spawn):
claude mcp add turbo-surf -- npx -y turbo-surf-mcpThat writes the server into your Claude Code config. npx -y turbo-surf-mcp
resolves + spawns the native binary over stdio — one process, no Node hosting it,
no browser.
3. Verify it's connected. List your servers (look for turbo-surf →
✓ connected):
claude mcp listNow start (or restart) Claude Code and ask it to, e.g., "use turbo-surf to fetch
the markdown of example.com" — the 60 tools above are available. Remove it later
with claude mcp remove turbo-surf.
Scope (where the server is registered). By default it's registered for your
user. To share it with a repo instead, add --scope project — that writes a
.mcp.json you can commit:
// .mcp.json — committed; teammates get the server automatically
{
"mcpServers": {
"turbo-surf": { "command": "npx", "args": ["-y", "turbo-surf-mcp"] }
}
}Running from a checkout (contributors) — point at a local build instead of npm:
cargo build --release -p turbo-surf-mcp --manifest-path rust/Cargo.toml
claude mcp add turbo-surf -- "$PWD/rust/target/release/turbo-surf-mcp"Troubleshooting. command not found: claude → install the Claude Code CLI.
npx hangs the first run → it's downloading the binary; let it finish (or pre-warm
with npx -y turbo-surf-mcp in a terminal, then Ctrl-C). Shows ✗ failed in
claude mcp list → run the command directly to see the error; on an unsupported
platform there's no prebuilt binary (build from a checkout as above).
Other MCP clients (Claude Desktop, Cursor, …) — point their MCP config's
command at npx with args ["-y", "turbo-surf-mcp"], or at the binary path.
Looking like a real browser (fingerprint + anti-bot)
By default every request carries a real Chrome 149 identity — full UA + client
hints (sec-ch-ua, sec-fetch-*, …) on the wire, and a matching Chrome
navigator (platform, vendor, webdriver: false, plugins, window.chrome,
native-toString) inside the JS render tier. Nothing to configure for the
common case — it just looks like Chrome.
Quick start (anti-bot)
# 1. (optional) real TLS/JA3 + HTTP-2 fingerprint — needs cmake + nasm
cargo build --release -p turbo-surf-mcp --features impersonate
# 2. (optional) enable a challenge solver — copy the template and fill ONE
cp .env.example .env
# then edit .env: TURBO_SURF_SOLVER=akamai # in-house, no key
# or: TURBO_SURF_SOLVER=cloudflare # in-house, no key
# or: TURBO_SURF_SOLVER=hyper + HYPER_API_KEY=...
# or: TURBO_SURF_SOLVER=browser + TURBO_SURF_BROWSER_CMD=...
# 3. run — the MCP server / crawler auto-detects walls and solves them
npx turbo-surf-mcpThat's it. With no .env you get the Chrome fingerprint (clears most sites);
set one TURBO_SURF_SOLVER to also clear challenge walls.
Configuration reference
All optional. Set in .env (auto-loaded) or the process env.
| Variable | Values | What it does |
|---|---|---|
TURBO_SURF_SOLVER |
cloudflare · awswaf · akamai · hyper · scrapfly · browser |
Pick the challenge solver. Unset = no solving (fingerprint only). cloudflare/awswaf/akamai are in-house (no key). |
HYPER_API_KEY |
key | Hyper Solutions token API (Akamai/DataDome/Kasada). Used when solver=hyper. |
SCRAPFLY_API_KEY |
key | Scrapfly ASP API (Cloudflare + fallback). Used when solver=scrapfly. |
TURBO_SURF_BROWSER_CMD |
shell command | The hardened-headless sidecar to run for solver=browser (e.g. node harness/browser-solver/solve.mjs). When solver=akamai, also enables the browser fallback (see below). |
TURBO_SURF_PROXY |
http://user:pass@host:port |
Egress proxy. Required for real solves — tokens bind to the IP/JA3 that minted them; replay must use the same egress. |
TURBO_SURF_SENSOR_DIR |
path (default akamai-sensors) |
Where analyze_akamai's retry saves a working sensor (keyed by script hash + version). |
Solver maturity — cloudflare (managed/Iuam) and awswaf (common + targeted
challenge.js) both run the challenge's own JS in the V8 tier (no browser) and
are the most viable in-house solves. akamai is experimental (the live sensor
crypto isn't reversed per-version yet). All in-house solvers try themselves first,
then fall back to the browser sidecar when TURBO_SURF_BROWSER_CMD is set — so a
failed self-solve still clears the wall. hyper/scrapfly/browser are the robust
paths. CAPTCHA tiers (AWS captcha, CF Turnstile-interactive) aren't self-solvable
and route straight to the sidecar.
Build feature (not env): --features impersonate swaps rustls → BoringSSL
(wreq) for a real Chrome TLS/JA3/JA4 + HTTP-2 fingerprint. Needs cmake+nasm.
MCP tools for stealth: set_fingerprint (override navigator fields),
stealth_status (inspect active profile/solver/overrides), probe (see
what a page's anti-bot JS reads), analyze_akamai (experimental: rebuild +
test Akamai sensors). Detailed below.
TLS/HTTP-2 fingerprint (impersonate). rustls can't forge Chrome's
JA3/JA4 + Akamai HTTP-2 fingerprint. Build with the opt-in feature to swap in a
BoringSSL client (wreq) that does — it presents a real Chrome 149 TLS + HTTP-2
fingerprint:
# needs a C toolchain (cmake + nasm) for BoringSSL
cargo build --release -p turbo-surf-mcp --features impersonateOff by default (keeps the pure-rustls build dependency-free). Verified against a live TLS/HTTP-2 echo in the test suite.
Fingerprint seed pool. turbo-surf-core::fingerprint holds ~4000 internally
coherent real-Chrome identities (UA client hints navigator all agree),
selected deterministically by a client key — the same host always gets the same
profile, while the fleet spreads across the pool. The MCP session and the crawl
navigator rotate per host automatically; in code:
use turbo_surf_core::{fingerprint, net::{fetch_html, FetchOptions}};
let profile = fingerprint::select("example.com"); // stable per key
let opts = FetchOptions { profile: Some(&profile), ..Default::default() };
let res = fetch_html("https://example.com/", opts).await?;JS-challenge / PoW walls (Akamai, DataDome, Kasada, Cloudflare). A fingerprint
gets you past consistency checks, not the active canvas/WebGL/PoW gates. Plug a
solver into the ChallengeSolver trait — the MCP session / crawl navigator
auto-detect a wall, solve it, and replay the cleared cookies on the fast path.
Inert until configured. Copy .env.example → .env and pick one:
- Rented (
TURBO_SURF_SOLVER=hyper|scrapfly+HYPER_API_KEY/SCRAPFLY_API_KEY). Hyper (akm.hypersolutions.co/v2/sensor,x-api-key) generates an Akamaisensor_datapayload; turbo-surf POSTs it to the target and harvests_abck(Akamai lane wired; other vendors fall back). Scrapfly (/scrape?asp=true& render_js=true) renders + returns cleared cookies (result.cookies). Both matched to their real API docs. - In-house, no key (
TURBO_SURF_SOLVER=cloudflare,awswaf, orakamai) — solve it yourself, no third party. Cloudflare runs the challenge's own JS in the V8 render tier to compute the answer (no reversing the math), then POSTs it and harvestscf_clearance. AWS WAF Bot Control (the bot layer behind CloudFront / ALB) does the same — runschallenge.jsin V8 to mint theaws-waf-token; its common tier clears on the minted token + fingerprint, thecaptchatier routes to the sidecar. Akamai is experimental (see below). - Self-owned browser sidecar (
TURBO_SURF_SOLVER=browser+TURBO_SURF_BROWSER_CMD="node harness/browser-solver/solve.mjs") — drives a real hardened headless Chromium just for the handshake, harvests the cookie, then turbo-surf does the volume. Opt-in only (spawns a process); the browser stays a dev-side sidecar, never in the engine. Seeharness/browser-solver/.
Akamai experimental flow (analyze_akamai). Akamai's live sensor_data is
per-version, obfuscated, and key-rotated, so the in-house solver isn't a guaranteed
live bypass — it's a recon → rebuild → test loop you drive from MCP:
// 1. navigate to an Akamai-walled page first (goto), then:
// MCP: analyze_akamai → finds the live Akamai script, hashes it, probes
// what it reads, builds candidate sensors per version
{ }
// MCP: analyze_akamai → with retry: POST each candidate, test live
{ "retry": true } // acceptance, and SAVE a working one locallyretry returns accepted (the version that cleared the wall, if any) and saves it
to TURBO_SURF_SENSOR_DIR. None accepted = that script's encoding still needs
reversing (the candidates are hash-seeded structural rebuilds). With
TURBO_SURF_BROWSER_CMD set, a normal Akamai goto auto-falls-back to the browser
sidecar when the in-house solve fails — so you stay unblocked while iterating.
Set TURBO_SURF_PROXY so the token's IP/JA3 matches your egress (and build with
--features impersonate so the replay JA3 matches the Chrome that minted it).
Controllable render fingerprint. Every render-tier navigator field has a
Chrome 149 default and is overridable at runtime via the MCP set_fingerprint
tool (or turbo_surf_render::set_fingerprint(json)):
// MCP: set_fingerprint
{ "overrides": {
"userAgent": "...", "platform": "Win32", "vendor": "Google Inc.",
"languages": ["en-GB", "en"], "hardwareConcurrency": 16, "deviceMemory": 8,
"chromeMajor": 150, "screen": { "width": 2560, "height": 1440 },
"devicePixelRatio": 2,
"connection": { "effectiveType": "4g", "rtt": 50, "downlink": 10 },
"userAgentData": { "platform": "Windows", "brands": [ /* … */ ] }
} } // {} resets to Chrome 149 macOS defaultsstealth_status reports the active profile, the wired solver, the pool size, and
any render fingerprint overrides.
Debug mode (probe). To see what a page's anti-bot JS reads — and what's
still missing — run the probe MCP tool (or turbo-surf-render::probe_globals):
it runs the page's scripts with navigator/screen/window.chrome/canvas
instrumented and reports every property touched plus the reads that returned
undefined (your shim to-do list). The runnable example
cargo run -p turbo-surf-render --example probe-script -- script.js does this on
any real captured challenge script.
Playwright drop-in (run your e2e specs with no browser)
turbo-surf's page API is Playwright-shaped, and rust/playwright-shim/ is a
drop-in @playwright/test replacement backed by the native engine — no
Chromium. A register.mjs module-resolution hook rewrites every
import … from "@playwright/test" (and playwright / playwright-core) to the
shim, so existing specs run unchanged on node:test:
node --import ./rust/playwright-shim/register.mjs --test 'e2e/**/*.spec.mjs'It implements Page, Locator, expect (all five assertion classes),
BrowserContext, APIRequestContext, and test + fixtures over the engine:
navigation, every getBy* locator, locator composition/filtering, DOM actions
(click/fill/check/selectOption/…), evaluate/render in the V8 tier,
cookies + real setExtraHTTPHeaders, and the full matcher set. Most of the surface
is pure JS and never crosses into Rust; only genuine DOM/render semantics do (and
an expect(locator) chain is batched into one crossing).
What a no-browser engine physically can't do fails honestly (it throws or no-ops — never a silent pass):
| Bucket | Examples | Behavior |
|---|---|---|
| Pixels / layout | screenshot, pdf, boundingBox, toHaveScreenshot, toMatchSnapshot |
throws |
| Input hardware | hover, dragTo, mouse/keyboard/touchscreen |
throws |
| Network interception | route, routeFromHAR, unroute |
throws |
| JShost bindings | exposeFunction, exposeBinding |
throws |
| Truly-async/time | waitFor*, live console/request events, timer-driven mutation |
resolve/no-op on the static DOM |
| Layout-only state | viewportSize, emulateMedia, frames |
stored / collapse to self |
Full per-method coverage map: rust/playwright-shim/LIMITATIONS.md.
Best fit: server-rendered and hydration-on-navigation apps. A client-only SPA
that paints entirely from JS after load (and never re-fetches) needs a real browser.
Python
The engine is also a PyO3 binding on PyPI — prebuilt abi3 wheels (CPython 3.8+) for Linux (x86_64/aarch64), macOS (arm64), and Windows (x64), no Rust toolchain, no Chromium. It's fetch-free: you pass a page's HTML in and get a view out (Markdown, visible text, links, a typed extraction, an a11y tree); for JS-gated pages it runs the page's own scripts in the same V8 isolate over a native DOM.
pip install turbo-surfimport turbo_surf as ts
html = open("page.html").read()
ts.markdown(html, base_url="https://example.com/") # -> Markdown str
ts.text(html) # -> visible text
ts.links(html, base_url="https://example.com/") # -> list[str]
# Typed extraction: a JSON schema maps field names to selector specs.
schema = '{"title": {"selector": "h1"}, "prices": {"selector": ".price", "list": true}}'
ts.extract(html, schema, base_url="https://example.com/") # -> JSON str
# JS-gated page: run its own scripts, read the hydrated DOM.
hydrated = ts.render(html, script, base_url="https://example.com/")Fatal faults (malformed schema JSON, a render-tier failure) raise
turbo_surf.TurboSurfError; the non-JS views never raise. Full function table:
rust/crates/turbo-surf-py/README.md.
Competitive benchmark
harness/competitive/ runs the same Playwright script on turbo-surf and a
fleet of real browsers, scoring output parity against a Chromium oracle and
timing each. npm run harness. turbo-surf drives every routine through its native
engine — turbo-dom + a deno_core V8 render tier for page JS — with no Chromium.
Median ms over 8 runs (live network), parity is each engine's observations vs the
Chromium oracle:
| engine | wikipedia | js-quotes | parity |
|---|---|---|---|
| turbo-surf (no-JS) | 142 | — (needs JS) | ✓ |
| turbo-surf (JS) | —‡ | 132 | ✓ |
| chromium (oracle) | 932 | 933 | — |
| firefox | 727 | 925 | ✓ |
| webkit | 1232 | 964 | ✓ |
Every engine produces the same observations as Chromium / Firefox / WebKit
(parity ✓) — and turbo-surf is the fastest in the table on both axes. The
Wikipedia click-through runs in 142 ms (~6.6× faster than Chromium, 932),
and the real jQuery on quotes.toscrape.com/js — the same 10 quotes Chromium
extracts — in 132 ms (~7× faster than Chromium, 933), inside a true V8
isolate over a native rtdom DOM, no Chromium process. It stays network-bound
via a pooled HTTP client (connection/TLS reuse across pages), a persistent V8
isolate whose DOM install is reused across same-page page.evaluates (parse
once per page, ~0.5 ms/call after), an external-script cache (jQuery fetched
once, not per page), a per-page parse cache, and a back-forward snapshot cache
so goBack restores instead of re-fetching. Profilers: cargo bench -p turbo-surf-view (Rust microbench) + harness/hotpath/rust-hotpath.mjs.
‡ JS mode runs the page's own scripts; on a server-rendered page like Wikipedia
that over-runs (use no-JS there — 142 ms, 4/4). The form routine is omitted this
run (httpbin.org was returning 503/timeouts for every engine).
The harness auto-detects installed engines (firefox/webkit, and anti-detect
browsers like playwright-extra/patchright/rebrowser-playwright); see
harness/competitive/README.md. (Numbers are
network-bound and machine/run dependent.)
Crawler-vs-crawler benchmark
harness/crawlers/ races turbo-surf against other open-source crawlers on a
real, paginated site — same 20-page same-host crawl of books.toscrape.com,
same ~150 ms per-request politeness on every engine, items counted with the
same CSS selector. Median of 3 timed runs, live network (npm run crawl-bench):
| crawler | runtime model | items | median ms | pages/s |
|---|---|---|---|---|
crawlee CheerioCrawler |
Node | 339 | 2767 | 7.2 |
| turbo-surf (no-js) | native Rust, browserless | 339 | 3271 | 6.1 |
| spider-rs | Rust + N-API | 194 | 3486 | 5.7 |
| got + cheerio (hand-rolled) | Node | 339 | 5590 | 3.6 |
node-crawler (crawler) |
Node | 339 | 49624 | 0.4 |
| Scrapy | Python (subprocess) | 246 | 49270 | 0.4 |
| Colly | Go (subprocess) | 320 | 45664 | 0.4 |
Head-to-head vs CheerioCrawler (the closest competitor) — the same 20-page
crawl, maxConcurrency = 2, median of 5 warm runs
(node harness/crawlers/head-to-head.mjs). With throttling off (raw engine
speed — the truest apples-to-apples, since the two throttle models differ)
turbo-surf is clearly ahead:
| engine | politeness | median ms | pages/s |
|---|---|---|---|
| turbo-surf (no-js) | none (raw) | 1977 | 10.1 |
| crawlee CheerioCrawler | none (raw) | 2748 | 7.3 |
| crawlee CheerioCrawler | 150 ms rate | 2634 | 7.6 |
| turbo-surf (no-js) | 150 ms (strict) | 3307 | 6.0 |
Raw, turbo-surf is ~1.4× faster. Under the "150 ms politeness" rows it looks
slower only because turbo-surf's per-host gate is a strict interval (it really
waits 150 ms between requests), whereas crawlee's maxRequestsPerMinute is a
lenient sliding window that lets a short 20-page burst through near-raw. Same
content (339 items), no browser.
At equal politeness the multi-engine wall-clock is network-bound, so the in-process crawlers cluster together: turbo-surf sits in the top tier alongside the dedicated speed engines (crawlee, spider-rs) and ahead of a hand-rolled got+cheerio loop — while extracting equivalent content and running no browser. turbo-surf runs the whole BFS — fetch, parse, same-host gate, per-page item count — in its native Rust engine, and is ~13× ahead of Scrapy / Colly. The heavyweight engines trail ~15×: Scrapy and Colly pay a fresh process startup per crawl (the harness shells out to their CLIs), and node-crawler's per-request overhead is high. turbo-dom's raw parse advantage doesn't show here — at 20 live pages, network swamps a sub-millisecond parse; it shows in-memory instead (links ~18k/s, crawl ~14k pages/s, below). And unlike every engine in this table, the same turbo-surf also runs Playwright scripts (parity table above).
JS-executing crawlers — turbo-surf vs real browsers
The other set targets quotes.toscrape.com/js, where quotes are built
client-side (a non-JS crawler sees ~0). turbo-surf runs the page's own scripts in
a true V8 isolate over its native DOM, while every competitor drives a real
headless Chromium (npm run crawl-bench:js, 11 pages, median of 3):
| crawler | JS engine | items | median ms | pages/s |
|---|---|---|---|---|
| turbo-surf (JS) | V8 isolate, no browser | 110 | 3031 | 3.63 |
crawlee PuppeteerCrawler |
headless Chromium | 110 | 4330 | 2.54 |
crawlee PlaywrightCrawler |
headless Chromium | 110 | 5569 | 1.98 |
| puppeteer-cluster | headless Chromium | 110 | 20024 | 0.55 |
turbo-surf runs each page's own scripts in a true V8 isolate over the native
rtdom DOM (the same path that renders quotes.toscrape.com/js, external scripts
cached across pages) — the fastest engine here, extracting the same 110 quotes
a real browser does ~1.4–6.6× faster than the browser-driving crawlers, with no
Chromium process and honoring the 150 ms politeness delay. Running page JS in a
true isolate against a lightweight DOM is rare — most crawlers either drive a full
browser (this table) or use an in-process fake DOM with no real isolation. See
harness/crawlers/README.md — competitors
auto-detect and missing ones are skipped; install them with
npm i -D @spider-rs/spider-rs crawlee cheerio got crawler (+ brew install pipx go && pipx install scrapy for Scrapy/Colly). (Machine/run dependent.)
Development
The engine is Rust (rust/ workspace); the only JS is the launcher (cli.js/
index.js) + the dev harness.
cd rust
cargo test --workspace # the offline suite
cargo clippy --workspace --all-targets
cargo fmt
cargo build --release -p turbo-surf-mcp # the MCP binary the launcher spawns
cargo bench -p turbo-surf-view # Rust microbench (parse / views)From the repo root: npm run check lints/formats the launcher JS, runs the Rust
gate, and runs the Playwright shim suites (npm run test:playwright =
build:addon → test:shim offline unit tests → test:e2e drop-in specs through
register.mjs, no Chromium). npm run harness / crawl-bench / hotpath run the
benchmarks (install playwright + crawlee ad-hoc for the competitor engines).
License
MIT