1.4.0 • Published 22h ago

crawl4ai-cloud

Licence

Apache-2.0

Version

1.4.0

Deps

Size

659 kB

Vulns

Weekly

281

Stars

Summary Dependency Versions

Crawl4AI Cloud SDK for Node.js

One-line web scraping, screenshots, and structured extraction — powered by Crawl4AI Cloud.

v0.7.0

What's new in 0.7.0

scrape + scrapeMany are the canonical methods (replacing markdown / markdownMany). Same shape, same response — they hit the new /v1/scrape(/async) endpoints.
markdown / markdownMany are kept as deprecated aliases that route to /v1/scrape. Removed in 0.8.0.
extractMany signature fixed: now takes extractMany(url, { extraUrls: […], … }) (was extractMany(urls[], …), which the server rejected). method: 'auto' is now allowed for batch.
sources field added to scan and map ('primary' | 'extended'). Replaces mode, which is kept as a deprecated alias.
crawlSite and deepCrawl target deprecated endpoints. They still work and emit a DeprecationWarning. Migrate to the composable scan + scrape/extract chain — see below.

Install

npm install crawl4ai-cloud

Quick Start

import { AsyncWebCrawler } from 'crawl4ai-cloud';

const c = new AsyncWebCrawler({ apiKey: 'sk_live_...' });

// Scrape a page → clean markdown (+ optional links/media/metadata/tables)
const page = await c.scrape('https://example.com');
console.log(page.markdown);

// Take a full-page screenshot
const shot = await c.screenshot('https://example.com');
// shot.screenshot is base64 PNG

// Extract structured data — AUTO picks css_schema vs llm.
// wait: true auto-hydrates job.results from per-URL S3.
const job = await c.extractMany('https://news.ycombinator.com', {
  method: 'auto',
  query: 'get each story title, URL, and points',
  wait: true,
});
for (const r of job.results ?? []) console.log(r.extractedContent);

// Discover all URLs on a domain
const sitemap = await c.map('https://docs.python.org', { sources: 'primary' });
sitemap.urls.slice(0, 10).forEach(u => console.log(u.url));

// Fast web search → ranked results in ~half a second
const results = await c.search('best espresso machine under 500', { country: 'us' });
results.results.forEach(hit => console.log(hit.rank, hit.title, hit.url));

Search the web

crawler.search() is a fast, synchronous web search — ranked results in ~0.5s (Google + Bing, merged and re-ranked). No jobs, no polling.

const resp = await c.search('best espresso machine under 500', {
  country: 'us',     // ISO-2 region bias (default 'us')
  language: 'en',    // ISO-639-1 (default 'en')
});
console.log(resp.ms, 'ms,', resp.results.length, 'results');
for (const hit of resp.results) {
  console.log(hit.rank, hit.title, hit.url, hit.domainTier);
}

Query enhancement — for messy, conversational asks with buried constraints, set enhance: true and an LLM rewrites the query first (adds ~300ms; the rewrite comes back on the response):

const resp = await c.search(
  'what laptop should i buy for video editing under 1500 but not a mac',
  { enhance: true },
);
console.log(resp.enhancedQuery);  // → 'best video editing laptop under $1500 Windows -mac'
console.log(resp.enhanceMs);      // enhancement time, separate from resp.ms

Option	Default	What it does
`country`	`'us'`	ISO-2 country code — biases results by region
`language`	`'en'`	ISO-639-1 language code
`enhance`	`false`	LLM rewrites the query first (best for vague/conversational queries)
`cache`	`true`	Reuse a recent identical search; `false` always fetches fresh
`site`	—	Restrict to one site (prepends `site:<site>`)

High-relevance re-ranking is always on. For multi-backend SERPs with AI overviews and synthesized answers, use crawler.discovery('search', ...) instead.

Crawl a whole site — composable two-step

There's no single bundled "crawl this site" method anymore. Compose it:

// Step 1 — discover URLs
const scan = await c.scan('https://books.toscrape.com', {
  criteria: 'all book detail pages',
  maxUrls: 20,
});
const urls = scan.urls.map(u => u.url);

// Step 2A — pipe to scrapeMany for markdown
const mdJob = await c.scrapeMany(urls, { strategy: 'http', wait: true });

// Step 2B — OR pipe to extractMany for structured fields. The base url is
// the schema TEMPLATE — schema is generated once, then re-applied across
// extraUrls for free in css_schema mode (10-100× cheaper than per-page LLM).
const [base,...rest] = urls;
const exJob = await c.extractMany(base, {
  extraUrls: rest,
  method: 'auto',
  query: 'book title, price, rating',
  wait: true,
});
for (const r of exJob.results ?? []) console.log(r.extractedContent);

Multi-URL fan-out — what happens under the hood

scrapeMany(urls), screenshotMany(urls), and extractMany(url, { extraUrls }) all decompose into N independent child jobs that distribute across the worker pool. Throughput scales with pool size, not with how many URLs you submitted.

const job = await c.scrapeMany(urls, { wait: false });             // returns immediately
// Poll to inspect per-URL state without downloading the data:
const status = await c.getScrapeJob(job.jobId);                    // WrapperJob
for (const u of status.urlStatuses ?? []) {
  console.log(u.index, u.url, u.status, u.durationMs, u.error);
}

// Fetch one URL's full result on demand (recipe-agnostic — works for any wrapper)
const result = await c.getPerUrlResult(job.jobId, 0);              // CrawlResult
console.log(result.markdown);

wait: true does this for you: poll → wait until terminal → call getPerUrlResult for each URL in parallel → populate job.results. Failed URLs become CrawlResult stubs (success=false + errorMessage) so job.results.length === job.urlStatuses.length.

Wrapper API Reference

Method	What it does	Endpoint
`scrape(url)`	Page → clean markdown + optional `include: ['links'\|'media'\|'metadata'\|'tables']`	`POST /v1/scrape`
`scrapeMany(urls)`	Async batch scrape (≤100 URLs)	`POST /v1/scrape/async`
`screenshot(url)`	Base64 PNG (and optional PDF)	`POST /v1/screenshot`
`screenshotMany(urls)`	Async batch screenshot	`POST /v1/screenshot/async`
`extract(url, { query })`	Sync extract — single URL	`POST /v1/extract`
`extractMany(url, { extraUrls })`	Async extract — base + followers, schema reused for free in css_schema mode	`POST /v1/extract/async`
`scan(url, { sources, criteria })`	URL discovery — sources=primary/extended, optional AI criteria	`POST /v1/scan`
`map(url, { sources })`	Simpler URL discovery (no criteria)	`POST /v1/map`
`enrich({ query })`	Multi-entity table from a brief, list of entities, or list of URLs	`POST /v1/enrich/async`

Deprecated (still work, emit warnings)

Method	Migration
`markdown(url)` / `markdownMany(urls)`	Use `scrape` / `scrapeMany`.
`crawlSite(url, { criteria, extract })`	Use `scan` + `extractMany`. See "Crawl a whole site" above.
`deepCrawl(url, …)`	Use `scan(url, { scan: { mode: 'deep' } })` + the chain above.
`scan(url, { mode })` / `map(url, { mode })`	Use `sources: 'primary'` or `sources: 'extended'`.

Enrich v2 (build a data table)

Multi-phase enrichment. Give a brief, a list of entities, or a list of URLs. The job walks through planning → resolving_urls → extracting → merging, with optional human-review pauses at plan_ready and urls_ready.

// 1. Agent one-shot — give a brief, get a table back
const result = await crawler.enrich({
  query: 'licensed nurseries in North York Toronto with extended hours',
  country: 'ca',
  topKPerEntity: 3,
});
for (const row of result.phaseData.rows ?? []) {
  console.log(row.inputKey, row.fields);
  for (const [f, c] of Object.entries(row.certainty)) {
    console.log(`  ${f}: ${c.toFixed(2)}  (from ${row.sources[f].url})`);
  }
}
console.log(`crawls=${result.usage.crawls}, llm=${JSON.stringify(result.usage.llmTotals)}`);

// 2. Pre-resolved URLs — skip planning + URL resolution
const r2 = await crawler.enrich({
  urls: ['https://example.com/a', 'https://example.com/b'],
  features: ['price', 'hours'],   // string shortcut for [{name: 'price'},...]
});

// 3. Human review flow — pause for editing the plan
const job = await crawler.enrich({
  query: 'best Italian restaurants in Brooklyn',
  country: 'us',
  autoConfirmPlan: false,
  autoConfirmUrls: false,
  wait: false,
});

let paused = await crawler.waitEnrichJob(job.jobId, { until: 'plan_ready' });
console.log(paused.phaseData.plan?.entities, paused.phaseData.plan?.features);

await crawler.resumeEnrichJob(job.jobId, {
  entities: [{ name: "Lucali" }, { name: "Roberta's" }],
  features: [{ name: 'address' }, { name: 'hours' }],
});
paused = await crawler.waitEnrichJob(job.jobId, { until: 'urls_ready' });
await crawler.resumeEnrichJob(job.jobId);   // accept server's URL picks
const final = await crawler.waitEnrichJob(job.jobId);

// 4. Live progress via SSE
for await (const event of crawler.streamEnrichJob(job.jobId)) {
  if (event.type === 'phase')   console.log('→', event.status);
  if (event.type === 'row')     console.log('✓', event.row?.inputKey);
  if (event.type === 'complete') break;
}

// Job management
const jobs = await crawler.listEnrichJobs({ limit: 5 });
await crawler.cancelEnrichJob(job.jobId);

Vocabulary: entities are row identifiers; criteria are search-side filters; features are extraction columns. Per-purpose usage is reported in result.usage.llmTokensByPurpose with five buckets: plan_intent, url_plan, paywall_classify, extract, merge_tiebreak (only buckets that ran appear).

Wrapper API Reference

Method	Endpoint	Returns	Description
`markdown(url)`	`POST /v1/markdown`	`MarkdownResponse`	Clean markdown from any page
`screenshot(url)`	`POST /v1/screenshot`	`ScreenshotResponse`	Full-page screenshot (PNG) or PDF
`extract(url)`	`POST /v1/extract`	`ExtractResponse`	LLM or schema-based structured extraction
`map(url)`	`POST /v1/map`	`MapResponse`	Simple URL discovery (always sync)
`scan(url, {criteria})`	`POST /v1/scan`	`ScanResult`	AI-assisted URL discovery with plain-English criteria
`crawlSite(url, {criteria, extract})`	`POST /v1/crawl/site`	`SiteCrawlResponse`	AI-assisted whole-site crawl (always async)
`enrich(opts)`	`POST /v1/enrich/async`	`EnrichJobStatus`	Multi-phase enrichment from query / entities / URLs

Markdown Options

const res = await c.markdown('https://example.com', {
  strategy: 'browser',   // 'browser' (default) or 'http'
  fit: true,              // return cleaned/fitted markdown (default true)
  include: ['article'],   // CSS selectors to include
  proxy: { mode: 'on', country: 'US' },  // proxy off by default; 'on' = residential
  bypassCache: true,
});
// res.markdown, res.fitMarkdown, res.links, res.media, res.metadata

Screenshot Options

const res = await c.screenshot('https://example.com', {
  fullPage: true,         // full page vs viewport (default true)
  pdf: true,              // also generate PDF
  waitFor: 'css:.loaded', // wait for element before capture
});
// res.screenshot (base64 PNG), res.pdf (base64 PDF)

Extract Options

// Auto mode -- LLM picks the best approach
const res = await c.extract('https://example.com/products', {
  query: 'Extract product name, price, and rating',
  method: 'auto', // 'auto' (default), 'llm', or 'schema'
});

// Schema mode -- CSS/XPath selectors, no LLM cost
const res = await c.extract('https://example.com/products', {
  method: 'schema',
  schema: {
    name: 'Products',
    baseSelector: '.product-card',
    fields: [
      { name: 'title', selector: 'h2', type: 'text' },
      { name: 'price', selector: '.price', type: 'text' },
    ],
  },
});

Map Options (legacy, sync-only)

const res = await c.map('https://example.com', {
  mode: 'deep',            // 'default' or 'deep' (DomainMapper source depth)
  maxUrls: 500,
  includeSubdomains: true,
  query: 'blog posts',     // relevance scoring
  scoreThreshold: 0.5,
});

Scan Options (AI-assisted)

// AI picks everything from criteria
const result = await c.scan('https://docs.crawl4ai.com', {
  criteria: 'API reference pages',
  maxUrls: 50,
});

// Explicit overrides on top of LLM output
const result = await c.scan('https://example.com', {
  criteria: 'product pages',
  scan: {
    mode: 'auto',          // 'auto' | 'map' | 'deep'
    patterns: ['*/p/*'],
    includeSubdomains: false,
  },
});

// Async deep scan (waits for completion)
const result = await c.scan('https://directory.example.com', {
  criteria: 'company profile pages',
  scan: { mode: 'deep', maxDepth: 3 },
  wait: true,              // poll until done
  pollInterval: 3,         // seconds
});

Site Crawl Options (AI-assisted flagship)

const job = await c.crawlSite('https://books.toscrape.com', {
  // AI-assisted fields (new in v0.4.0)
  criteria: 'all book listing pages',
  scan: { mode: 'auto' },                        // optional explicit override
  extract: {
    query: 'book title, price, rating',
    jsonExample: { title: '...', price: '£0.00', rating: 0 },
    method: 'auto',                              // 'auto' | 'llm' | 'schema'
  },
  include: ['markdown', 'links'],                // drop 'markdown' to strip it
  // Standard fields
  maxPages: 50,
  strategy: 'http',                              // 'browser' or 'http'
  fit: true,
  // Legacy fields (still supported, prefer the above)
  discovery: 'bfs',                              // 'map' (default) / 'bfs' / 'dfs' / 'best_first'
  pattern: '/docs/*',
  maxDepth: 3,
  // Waiting
  wait: true,                                    // poll until done
  pollInterval: 5,                               // seconds
  timeout: 300,                                  // seconds
});

// job.generatedConfig — LLM's scan + extract decisions
// job.extractionMethodUsed — "css_schema" | "llm"
// job.schemaUsed — generated CSS schema (reusable!)

Drop markdown with include: pass include: ['links', 'media'] (without 'markdown') and the worker strips markdown from every result — saves bandwidth for extract-only crawls.

Async / Batch Operations

For multiple URLs, use the batch variants. They return a WrapperJob you can poll or wait on.

markdownMany

const job = await c.markdownMany(
  ['https://a.com', 'https://b.com', 'https://c.com'],
  { wait: true, timeout: 120 },
);
console.log(`Job ${job.jobId}: ${job.status}`);

screenshotMany

const job = await c.screenshotMany(
  ['https://a.com', 'https://b.com'],
  { fullPage: true, wait: true },
);

extractMany

const job = await c.extractMany(
  ['https://a.com/products', 'https://b.com/products'],
  {
    method: 'llm', // 'auto' not supported for batch -- use 'llm' or 'schema'
    query: 'Extract product name and price',
    wait: true,
  },
);

Batch Options (shared)

All batch methods accept these additional options:

Option	Type	Default	Description
`wait`	`boolean`	`false`	Block until job completes
`pollInterval`	`number`	`2`	Seconds between status checks (when `wait: true`)
`timeout`	`number`	--	Max seconds to wait
`webhookUrl`	`string`	--	URL to POST results to on completion
`priority`	`number`	`5`	Job priority (1=high, 5=normal, 10=low)

Job Management

Wrapper Jobs (markdown, screenshot, extract)

// Get job status
const job = await c.getMarkdownJob(jobId);
const job = await c.getScreenshotJob(jobId);
const job = await c.getExtractJob(jobId);

// List jobs
const jobs = await c.listMarkdownJobs({ status: 'completed', limit: 10 });
const jobs = await c.listScreenshotJobs({ limit: 5 });
const jobs = await c.listExtractJobs;

// Cancel
await c.cancelMarkdownJob(jobId);
await c.cancelScreenshotJob(jobId);
await c.cancelExtractJob(jobId);

Crawl Jobs (from runMany / deepCrawl)

const jobs = await c.listJobs({ status: 'completed', limit: 10 });
const job = await c.getJob(jobId);
const job = await c.waitJob(jobId, { pollInterval: 2, timeout: 300 });
await c.cancelJob(jobId);

// Download results as ZIP
const url = await c.downloadUrl(jobId);

Scan Jobs (AI-assisted deep scans)

// Poll status + URLs discovered so far
const job = await c.getScanJob(jobId);
console.log(`Status: ${job.status}, found: ${job.totalUrls}`);

// Cancel a running deep scan (preserves partial results)
await c.cancelScanJob(jobId);

Site Crawl Jobs (unified scan + crawl polling)

const status = await c.getSiteCrawlJob(jobId);
console.log(`${status.phase}: ${status.progress.urlsCrawled}/${status.progress.total}`);
// status.phase walks: "scan" -> "crawl" -> "done"
// status.downloadUrl is set when phase === "done" and status === "completed"

Power User: Config Passthrough

Every wrapper method accepts optional crawlerConfig and browserConfig for full control over the underlying crawl engine.

const res = await c.markdown('https://spa-app.com', {
  crawlerConfig: {
    waitFor: 'css:.content-loaded',
    jsCode: 'window.scrollTo(0, document.body.scrollHeight)',
    scanFullPage: true,
    pageTimeout: 60000,
  },
  browserConfig: {
    viewportWidth: 1920,
    viewportHeight: 1080,
    headers: { 'Accept-Language': 'en-US' },
  },
  proxy: { mode: 'on', country: 'US' },  // proxy is off by default
});

This works on markdown, screenshot, extract, and their batch variants. The map method accepts proxy only (no browser needed).

Proxy modes

Proxy is off by default - requests go out directly. Opt in with the proxy block:

Mode	What it does	Credits
`'off'`	No proxy - direct egress. Default.	1×
`'on'`	Standard residential proxy. Add `country` (ISO-2) to geo-target.	5×
`'crawl4ai'`	Crawl4AI proxy (Web-Unlocker) for hard / anti-bot sites..	10×

There's no auto-escalation - a request uses exactly the mode you ask for. Older values still work ('none' -> off; 'residential' / 'datacenter' / 'auto' -> on).

Full Power Mode

For advanced use cases -- custom extraction strategies, full HTML access, media/link parsing, session management -- use the core run / runMany / deepCrawl methods directly.

Single Crawl

const result = await c.run('https://example.com', {
  config: { screenshot: true, excludeExternalLinks: true },
  browserConfig: { viewportWidth: 1920 },
  proxy: { mode: 'on', country: 'US' },  // off by default; 'on' = residential
});
console.log(result.markdown?.rawMarkdown);
console.log(result.screenshot); // base64
console.log(result.links);
console.log(result.media);

Batch Crawl

const job = await c.runMany(
  ['https://a.com', 'https://b.com'],
  { wait: true, priority: 1 },
);

Deep Crawl

const result = await c.deepCrawl('https://docs.example.com', {
  strategy: 'bfs',
  maxDepth: 3,
  maxUrls: 200,
  wait: true,
});

Scan (URL Discovery Only)

const scan = await c.scan('https://example.com', { maxUrls: 100 });
console.log(`Found ${scan.totalUrls} URLs across ${scan.hostsFound} hosts`);

Error Handling

import {
  CloudError,
  AuthenticationError,
  RateLimitError,
  QuotaExceededError,
  NotFoundError,
  ValidationError,
  TimeoutError,
} from 'crawl4ai-cloud';

try {
  const res = await c.markdown('https://example.com');
} catch (err) {
  if (err instanceof AuthenticationError) {
    console.log('Bad API key');
  } else if (err instanceof RateLimitError) {
    console.log(`Rate limited, retry after ${err.retryAfter}s`);
  } else if (err instanceof QuotaExceededError) {
    console.log('Credits exhausted');
  } else if (err instanceof TimeoutError) {
    console.log('Request timed out');
  } else if (err instanceof CloudError) {
    console.log(`API error ${err.statusCode}: ${err.message}`);
  }
}

Environment Variables

export CRAWL4AI_API_KEY=sk_live_...

// API key auto-loaded from CRAWL4AI_API_KEY
const c = new AsyncWebCrawler;

License

Apache 2.0

Keywords

crawl4ai web-scraping crawler cloud api web-crawler scraping markdown