crawlex
The stealth crawler that actually looks like Chrome.
TLS, HTTP/2, JS fingerprint — every byte indistinguishable from real Chrome 149.
Rust core • Node SDK • Lua hooks • cross-platform binaries.
pnpm add -g crawlex && crawlex pages run --seed https://example.com --method renderQuickstart · Features · Examples · Docs · Why crawlex
Why crawlex
Standard crawlers fail on the first Cloudflare wall. crawlex arrives the way real Chrome arrives — every fingerprint surface is identical, not approximated.
| Layer | What we match — exactly, not approximately |
|---|---|
| TLS ClientHello | Extension order, ALPS, GREASE values, permute_extensions, X25519MLKEM768, signature algorithms — verified against tls.peet.ws and ja4db.com oracles |
| HTTP/2 frame | Pseudo-header order :method :authority :scheme :path, SETTINGS frame parameters, WINDOW_UPDATE pattern — passes Akamai BMP signature checks |
| JS fingerprint | 29-section stealth shim: navigator, chrome.*, permissions, plugins, screen, timezone, battery, WebGL (vendor / params / extensions), canvas (zero-preserving noise), AudioContext (FFT + offline render), Function.prototype.toString proxy, WebGPU, performance.memory, sensors, iframe, requestAnimationFrame throttle, performance.now() 100µs grain, mediaDevices, fonts, WebRTC SDP/ICE/getStats scrub |
| Behavior | Mouse jitter, scroll cadence, dwell time, idle drift — coherent motion:: profiles per persona |
| Catalog | 30 Chrome stable × 30 Chromium × 20 Firefox × Edge × Safari fingerprints. Era-fallback resolution: ask for chrome-149-linux, get the closest captured profile |
| Worker scope | Same shim auto-attached to dedicated / shared / service workers via CDP Target.setAutoAttach |
→ Validated against BrowserScan, CreepJS, Sannysoft, tls.peet.ws, ja4db.com.
Install
# npm — bundled binary download via postinstall
pnpm add -g crawlex
# Rust — from source
cargo install crawlex
# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)
# https://github.com/forattini-dev/crawlex/releases/latestProduction crawls run locally, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.
Last 24h highlights
1.0.6release line focuses on canonical URL identity, lossless frontier admission and cache/queue alias deduplication.- JS/TS hooks now run through the SDK bridge, so
defineHooks()can drive the same lifecycle decisions as embedded Rust hooks. - NDJSON events now carry richer artifacts, Web Vitals, per-fetch timings, crawl-attempt telemetry and crawl-resolution summaries.
- The supported release artifact is the full
crawlexbinary with RedDB persistence enabled. - Large crawl efficiency grew: cache validation, prefetch discovery mode and best-first URL scoring are now available from CLI/config.
- Render fallback grew: external CDP connection, GPU posture control, Shadow DOM flattening, overlay cleanup and last-resort fallback fetch are configurable.
Quickstart
# Stealth render with persona, sitemap discovery, NDJSON event stream
crawlex pages run \
--seed https://target.com \
--method render \
--persona atlas \
--max-depth 3 \
--screenshot \
--emit ndjson > events.ndjson
# Live tail what just happened
jq -c 'select(.event == "fetch.completed" or .event == "render.completed")' events.ndjsonThree integration paths, your pick:
| CLI | Node SDK | Embedded Rust |
|---|---|---|
One-shot crawls, scripted pipelines. | Production services with hook logic. | In-process embedding, zero IPC. |
Examples
1. Hunt a SaaS product page with vitals + screenshot
import { crawl } from 'crawlex';
for await (const ev of crawl({
seeds: ['https://stripe.com/pricing'],
args: {
method: 'render',
persona: 'atlas', // macOS Apple M1, Retina, en-US
screenshot: true,
screenshotMode: 'fullpage',
storage: 'filesystem',
storagePath: './out',
waitStrategy: '{"NetworkIdle":{"idle_ms":1500}}',
},
})) {
if (!('event' in ev)) continue;
switch (ev.event) {
case 'render.completed':
console.log(`✅ ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);
break;
case 'artifact.saved':
if (ev.data.kind === 'screenshot.full_page')
console.log(`📸 → out/${ev.data.path} (${(ev.data.size/1024).toFixed(0)}kB)`);
break;
case 'challenge.detected':
console.log(`🚧 ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);
break;
}
}2. Crawl an entire domain with proxy rotation + retry policy
import { crawl, defineHooks } from 'crawlex';
const hooks = defineHooks({
// Rate-limit retry: 429/503 → re-enqueue (up to retry_max)
async onAfterFirstByte(ctx) {
if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';
return 'continue';
},
// Inject the canonical sitemap.xml for every host we touch
async onDiscovery(ctx) {
const host = new URL(ctx.url).host;
return {
decision: 'continue',
patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },
};
},
// Tag the crawl with custom metadata that lands in user_data
async onJobStart(ctx) {
return {
decision: 'continue',
patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },
};
},
});
for await (const ev of crawl({
seeds: ['https://target.com'],
args: {
method: 'auto', // policy engine picks http vs render
maxConcurrentHttp: 8,
maxConcurrentRender: 2,
maxDepth: 5,
crtsh: true, // certificate-transparency seeding
storage: 'reddb',
storagePath: 'file://./state/crawl.rdb',
queue: 'reddb',
queuePath: 'file://./state/frontier.rdb',
proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],
proxyStrategy: 'health-weighted',
proxyStickyPerHost: true,
},
hooks,
signal: AbortSignal.timeout(30 * 60_000),
})) {
if (!('event' in ev)) continue;
if (ev.event === 'job.failed') console.error(`✗ ${ev.url} — ${ev.data.error}`);
if (ev.event === 'run.completed') console.log('done.');
}3. Embedded library with custom Rust hooks
use crawlex::{Config, Crawler, queue::FetchMethod};
use crawlex::hooks::{HookDecision, HookRegistry};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
#[tokio::main]
async fn main() -> crawlex::Result<()> {
let hooks = HookRegistry::new();
let pages_seen = Arc::new(AtomicUsize::new(0));
// Closure-captured counter — observe without intervening
let counter = pages_seen.clone();
hooks.on_response_body(move |_ctx| {
let c = counter.clone();
Box::pin(async move {
c.fetch_add(1, Ordering::Relaxed);
Ok(HookDecision::Continue)
})
});
// Domain-level deny list — short-circuit before fetch
hooks.on_before_each_request(|ctx| {
let url = ctx.url.clone();
Box::pin(async move {
if url.path().starts_with("/admin/") { return Ok(HookDecision::Skip); }
Ok(HookDecision::Continue)
})
});
let config = Config::builder()
.max_concurrent_http(16)
.build()?;
let crawler = Crawler::new(config)?.with_hooks(hooks);
crawler.seed_with(
vec!["https://target.com".parse().unwrap()],
FetchMethod::HttpSpoof,
).await?;
crawler.run().await?;
println!("Crawled {} pages", pages_seen.load(Ordering::Relaxed));
Ok(())
}→ Full runnable example: examples/embedded_with_hooks.rs
4. Pin a specific browser fingerprint from the catalog
# Browse 80+ ready-to-use fingerprints
crawlex stealth catalog list
crawlex stealth catalog list --filter chrome
crawlex stealth catalog show chrome-149-linux
# Pin a precise version + OS
crawlex pages run --seed https://target.com \
--profile chrome-149-linux
# Era fallback: chromium-122 not captured? falls back to closest era + warns
crawlex pages run --seed https://target.com \
--profile chromium-122-linux
# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://target.com \
--method render --persona pixel5. Inspect what your stealth stack actually emits
# Print active IdentityBundle + TLS profile summary
crawlex stealth inspect --profile chrome-149-linux
# Verify ALPN/cipher/JA4 against built-in expectations
crawlex stealth test
# Compare against tls.peet.ws / ja4db.com via the live oracle
crawlex stealth catalog show chrome-149-linux --json6. Large crawl: validate cache, prefetch links, score the frontier
crawlex pages run \
--seed https://docs.example.com \
--method auto \
--queue reddb --queue-path file://./state/frontier.rdb \
--storage reddb --storage-path file://./state/crawl.rdb \
--cache-validate \
--cache-max-age-secs 86400 \
--prefetch \
--best-first \
--score-keyword docs \
--score-keyword api \
--emit ndjsonThis mode is for discovery passes: reuse fresh cache rows, harvest links cheaply, and let higher-value URLs rise in the queue before expensive render passes.
Features
Stealth core
Discovery
Antibot policy engine
|
Pipeline
Observability
Integrations
|
NDJSON event stream
Every run emits one JSON envelope per line on stdout. Versioned, stable, 21 kinds:
{"v":1,"event":"run.started","ts":"2026-04-26T19:42:00.000Z","run_id":42,"data":{"policy_profile":"strict","max_concurrent_http":8,"max_concurrent_render":2}}
{"v":1,"event":"job.started","run_id":42,"url":"https://target.com/","data":{"job_id":"j_001","method":"render","depth":0,"priority":0,"attempts":0}}
{"v":1,"event":"fetch.completed","run_id":42,"url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"bytes":98234,"body_truncated":false,"dns_ms":12,"tcp_connect_ms":18,"tls_handshake_ms":24,"ttfb_ms":142,"download_ms":83,"total_ms":280,"alpn":"h2","tls_version":"TLSv1.3","cipher":"TLS_AES_128_GCM_SHA256"}}
{"v":1,"event":"crawl.attempted","run_id":42,"url":"https://target.com/","data":{"crawl_id":42,"attempt_index":1,"engine":"http_spoof","status":403,"blocked":true,"block_reason":"Cloudflare challenge form"}}
{"v":1,"event":"render.completed","run_id":42,"session_id":"sess_abc","url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"manifest":true,"service_workers":1,"is_spa":true,"vitals":{"ttfb_ms":142,"first_contentful_paint_ms":380.5,"largest_contentful_paint_ms":920.1,"cumulative_layout_shift":0.03,"total_blocking_time_ms":50.0,"dom_nodes":1842,"js_heap_used_bytes":12345678,"resource_count":45,"total_transfer_bytes":982341}}}
{"v":1,"event":"artifact.saved","run_id":42,"url":"https://target.com/","data":{"kind":"screenshot.full_page","mime":"image/png","size":1234567,"sha256":"a1b2c3...","path":"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png"}}
{"v":1,"event":"challenge.detected","run_id":42,"url":"https://protected.com/","data":{"vendor":"cloudflare_turnstile","level":"widget_present"}}
{"v":1,"event":"decision.made","run_id":42,"url":"https://protected.com/","why":"render:js-challenge","data":{"decision":"retry","reason":{"code":"render:js-challenge"}}}
{"v":1,"event":"crawl.resolved","run_id":42,"url":"https://target.com/","data":{"crawl_id":42,"attempts_count":2,"fallback_fetch_used":false,"resolved_by":"render","success":true}}
{"v":1,"event":"run.completed","run_id":42}
Discriminator key: event (snake_case) — TypeScript narrows via switch (ev.event) { … }. Fallback for malformed lines: { kind: 'raw', line } so consumers can log/recover.
Hooks — 12 lifecycle points × 3 languages
before_each_request → after_dns → after_tls → after_first_byte → on_response_body
→ after_load → after_idle → on_discovery → on_job_start → on_job_end
→ on_error → on_robots_decision
| Language | API | Best for |
|---|---|---|
| Rust | hooks.on_after_first_byte(closure) — full &mut HookContext access |
Embedded library, latency-critical paths |
| JS / TS | defineHooks({...}) via SDK — IPC bridge, async closures |
Production crawls, business logic |
| Lua | --hook-script foo.lua — page-driving helpers (page_click, page_eval) |
Ad-hoc scripts, no build step |
All three modes return the same decision: continue / skip / retry / abort. Hooks can mutate ctx.captured_urls, inject extra URLs, write to user_data to communicate with downstream hooks, or override robots_allowed.
Personas — coherent identity bundles
Each persona is a complete bundle — UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings — so every signal matches. No mismatched UA + WebGL combo gives you away.
| Codename | OS | GPU | Locale | Form factor |
|---|---|---|---|---|
tux |
Linux | Intel UHD 630 | en-US | desktop 1920×1080 |
office |
Windows 10 | Intel UHD 620 | en-US | laptop 1920×1080 (DPR 1.25) |
gamer |
Windows 10 | NVIDIA GTX 1060 | pt-BR | desktop 1920×1080 |
atlas |
macOS | Apple M1 | en-US | retina 1440×900 (DPR 2.0) |
pixel |
Android 14 | Adreno 640 | pt-BR | mobile 412×823 (DPR 2.625) |
crawlex pages run --seed https://target.com --persona atlas # macOS
crawlex pages run --seed https://target.com --persona pixel # mobileArchitecture
flowchart LR
S[Seeds] --> Q[Frontier<br/>+ dedupe + rate-limit]
Q --> P[Policy Engine]
P --> C[Cache Validator<br/>ETag + Last-Modified + head fingerprint]
C -->|fresh| ST[Storage<br/>5 traits]
C -->|stale| F[ImpersonateClient<br/>BoringSSL + h2 patched]
P -->|http| F
P -->|render| R[RenderPool<br/>Chromium + stealth shim]
F --> X[Extractor<br/>+ Asset Refs]
R --> X
X --> D[Discovery<br/>Pipeline]
X --> ST
D --> Q
P --> EV[NDJSON Events<br/>21 kinds]
R --> H1[Rust Hooks]
R --> H2[JS Bridge]
R --> H3[Lua Scripts]
Module map:
impersonate/— TLS catalog + BoringSSL connector + ALPS + GREASErender/— Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runnerdiscovery/— 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, …)policy/— pure engine:decide_pre_fetch,decide_post_fetch,decide_post_error,decide_post_challengeantibot/— vendor classifier + 4 captcha solver adapterscache_validator/— cache freshness by HTTP validators and head fingerprintsstorage/— 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)events/— NDJSON envelope + sink (stdout / null / memory)hooks/— registry + JS bridge + Lua host
Tech stack
| Layer | Implementation |
|---|---|
| TLS | boring-sys — BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768 |
| HTTP/2 | Vendored h2 crate with pseudo-header order patch (vendor/h2) |
| CDP | chromiumoxide-derived, embedded behind cdp-backend feature |
| Async | tokio multi-thread |
| Storage / Queue | RedDB embedded/client storage plus RedDB Queue frontier; DashMap memory only for isolated tests |
| Discovery | hickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt) |
| Lua | mlua 0.10 (optional, lua-hooks feature) |
| SDK | Node 20+, CommonJS, zero runtime deps |
Release binary:
crawlex— full build with HTTP impersonation + Chromium rendering + stealth shim + RedDB persistence
Versus the alternatives
| crawlex | Playwright stealth | Puppeteer + plugins | curl-impersonate | |
|---|---|---|---|---|
| TLS-perfect ClientHello | BoringSSL | relies on Chromium | relies on Chromium | |
| H2 pseudo-header order | patched h2 | Chromium default | Chromium default | |
| 29-section JS leak coverage | partial | via plugins | no JS | |
| Worker-scope stealth | auto-attach | manual | manual | |
| HTTP-only path (no browser) | full binary policy path | |||
| RedDB telemetry + resume state | native | external | external | |
| Discovery pipeline | 17 stages | |||
| Streaming NDJSON events | versioned | |||
| Rust embedding | libcurl | |||
| Single binary |
Documentation
- forattini-dev.github.io/crawlex — full docsify hub
- Architecture overview
- CLI reference
- Config JSON schema
- NDJSON event envelope
- Guides — HTTP-only, rendered sessions, persistent runs
- Stealth & proxies
Contributing
git clone https://github.com/forattini-dev/crawlex
cd crawlex
# Unit tests + offline shim compliance
cargo test --lib # 820+ tests
cargo test --test fpjs_compliance # 27 cases
cargo test --test tls_catalog_coverage --test tls_catalog_roundtrip
# SDK tests
pnpm test # 21 node:test cases
# Quality gates
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo publish --dry-run --locked
# Live integration tests (require system Chromium)
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignoredCI runs all of the above on every PR. Contributions welcome — issues, feature requests, and PRs all reviewed.
License
Dual-licensed under MIT OR Apache-2.0 at your option. SPDX: MIT OR Apache-2.0.
Third-party attribution: see NOTICE.
Built for crawlers who refuse to be detected.
Docs · Releases · Issues · Discussions