1.1.8 • Published 3d agoCLI

@nikx/dory-worker

Licence

ISC

Version

1.1.8

Deps

Size

226 kB

Vulns

Weekly

2.9K

Summary Dependency Versions

dory-worker

BullMQ job consumer for the Dory web scraping platform. Runs on any machine with Docker — including a Raspberry Pi. Pulls scraping jobs off a shared Redis queue and executes them by launching dory-core containers locally.

npm: @nikx/dory-worker@1.0.3

Architecture

dory-api (Railway)
  └─ enqueues job → BullMQ (Redis)
       └─ dory-worker (your home machine / Pi)
            │  GET /api/runs/:id/config
            │  POST /api/runs/:id/status  (running / completed / failed)
            │
            ├─ Single-container mode  (containerCount = 1)
            │    └─ docker run dory-core:v2
            │         └─ Crawlee in-memory queue
            │
            └─ Distributed mode  (containerCount > 1)
                 ├─ docker run dory-core:v2 × N
                 │    ├─ REDIS_URL=redis://host.docker.internal:6379
                 │    ├─ QUEUE_NAME=<runId>   ← job-scoped, isolated
                 │    ├─ WORKER_ID=worker-1..N
                 │    └─ IDLE_TIMEOUT_SECS=60
                 │
                 └─ Shared Redis queue  (rq:<queueId>:*)
                      ├─ :meta      queue metadata
                      ├─ :requests  all URLs ever added (Hash)
                      ├─ :ordering  Lua-locked sorted set
                      └─ :handled   completed requestIds (Set)

Multi-Worker Coordination (Helper-Join Model)

When multiple workers are running, they automatically coordinate to share work on distributed jobs:

Worker-1 picks up job from BullMQ
  └─ Becomes coordinator → spawns container(s)
  └─ POST /api/runs/:id/status { status: "running", coordinatorWorkerId: "worker-1" }

Worker-2, Worker-3 (heartbeat loop every 10s)
  └─ GET /api/runs/active/distributed → discovers active run
  └─ POST /api/runs/:id/join → becomes helper participant
  └─ Spawns own container(s) sharing the same Redis crawl queue
  └─ On exit → POST /api/runs/:id/leave

Key behaviors:

Workers discover active distributed runs via heartbeat polling (every 10 seconds)
Each worker can participate in multiple concurrent runs (limited by MAX_CONCURRENT_RUNS)
Helpers exit independently when the shared queue drains (idle timeout)
Coordinator waits for all helpers to leave before reporting final status
Schedule deduplication: if a scheduled job triggers while a previous run is still queued/running, the new trigger is skipped

Distributed queue internals

Concern	Mechanism
Deduplication	`SHA-256(uniqueKey).slice(0,15)` → requestId; `HGET :requests` guard before any write
Atomic locking	Lua script `LUA_LIST_AND_LOCK` — `ZADD` score = ±lockExpiresAt; no two containers claim the same URL
Retry	Crawlee increments `retryCount`, re-enqueues until `maxRequestRetries`; exhausted → `SADD :handled` + `errorMessages`
Idle shutdown	`IDLE_TIMEOUT_SECS` = `min(15, actorTimeoutSecs / 2)`; containers exit cleanly when the queue drains

Prerequisites

Node.js ≥ 20
Docker (with access to dory-core:v2 image — build locally or pull from registry)
Redis (local container or remote — same instance used by dory-api)

Quick Start

Install (recommended)

Install Node.js ≥ 20.6, then install the worker globally, log in, and run the interactive setup (installs Docker, writes .env, pre-pulls the image, runs health checks, and starts the worker as a systemd/launchd service):

npm install -g @nikx/dory-worker
dory-worker login        # authenticate to the API (required before setup)
dory-worker setup

dory-worker login asks for your dory-api URL and your Dory username/password, then mints and stores a per-worker API key. setup (and start) refuse to run until you've logged in. See Authentication below.

dory-worker is a full management CLI — run it with no subcommand to start the worker in the foreground, or use any of:

dory-worker login        # authenticate to the API and store a worker key
dory-worker logout       # remove the stored worker key
dory-worker whoami       # show the current login
dory-worker setup        # full one-command install + configure + start service
dory-worker configure    # (re)write the .env interactively
dory-worker start        # run in the foreground (Ctrl+C to stop)
dory-worker start-bg     # install + start the background service
dory-worker status       # service status
dory-worker logs         # tail live logs
dory-worker health       # check Docker / API / Redis connectivity
dory-worker update       # update to the latest published version
dory-worker restart      # restart after a config change
dory-worker stop         # stop the background service
dory-worker uninstall    # remove the background service

Authentication

The API requires every worker to authenticate with a per-worker API key (dwk_…). The worker sends it as an X-API-Key header on all API calls, and passes it to the scraping containers it spawns so their result uploads are authenticated too. Without a key, the worker can't fetch config, heartbeat, or join runs — setup/configure/start/start-bg refuse to run until you have one.

Getting a key — two ways

1. dory-worker login (typical). You're on the worker machine and have a Dory login:

dory-worker login
# Dory API URL [http://localhost:4500]: https://your-api.example.com
# Username: alice
# Password: ********
# Worker label (optional): prod-eu-1

This verifies your credentials, mints a key, and saves it (with the API URL and label) to ~/.dory/credentials (mode 0600). Then run dory-worker setup.

Non-interactive (CI/scripts):

dory-worker login --api https://your-api.example.com -u alice -p "$DORY_PASSWORD" --label ci-runner
# or set DORY_USERNAME / DORY_PASSWORD in the environment

2. Admin-generated key. An admin generates a key in the UI (Worker Keys → Generate key) and gives you the dwk_… string. You don't need a Dory login — set it as an env var instead of running login:

export WORKER_API_KEY=dwk_...
dory-worker setup

WORKER_API_KEY takes precedence over ~/.dory/credentials, which is handy for containers and unattended deploys.

Managing keys

dory-worker whoami — show the current login (user, API, label)
dory-worker logout — remove the stored key from ~/.dory/credentials
Admins can list and revoke keys in the UI (Worker Keys page). Revoking a key cuts the worker off immediately — it must login again (or get a new key) to reconnect.

Manual setup (advanced)

If you'd rather wire things up yourself instead of using dory-worker setup:

1. Install globally

npm install -g @nikx/dory-worker

This gives you the dory-worker CLI command.

2. Pull the scraper image

docker pull bynikx/dory-core:v2

3. Create a `.env` file

Create a .env file in the directory where you'll run the worker:

mkdir -p ~/dory-worker && cd ~/dory-worker

cat > .env << 'EOF'
# Only BOOTSTRAP keys go here. Operational config (Docker image, concurrency,
# container count, crawler Redis, log level) is served by the API and applied
# at startup / hot-reloaded — manage it in the dashboard, not here. Results
# upload through the API, so workers need no GCS bucket or GCP credentials.

# ─── Required ──────────────────────────────────────────────────────────
# Public URL of your dory-api instance. MUST match the API URL you used with
# `dory-worker login` — the worker key is tied to that API, so a mismatch means
# authenticating to one API while talking to another. (The guided `setup`/
# `configure` flow sets this automatically from your login; you only set it by
# hand on this manual path.)
API_BASE_URL=https://your-api.railway.app

# ─── Redis (same instance used by dory-api) ───────────────────────────
# Option A: full URL (recommended)
REDIS_URL=redis://default:password@your-redis-host:6379

# Option B: host + port (uncomment these instead of REDIS_URL)
# REDIS_HOST=your-redis-host
# REDIS_PORT=6379
# REDIS_PASSWORD=your-password

# BullMQ queue name — must match the API's QUEUE_RUN_EXECUTION
QUEUE_RUN_EXECUTION=run-execution

# Optional: friendly name for this worker (defaults to dory-worker-{pid}).
# Also used to target this node with a per-worker config override in the API.
# WORKER_ID=pi-worker-01

# ─── Auth ──────────────────────────────────────────────────────────────
# Per-worker API key. Normally you DON'T set this here — `dory-worker login`
# stores it in ~/.dory/credentials. Set it explicitly only for unattended
# deploys (containers/CI) or when an admin generated a key for you in the UI.
# It takes precedence over ~/.dory/credentials.
# WORKER_API_KEY=dwk_...

# ─── Optional: pin this node to local config (ignore API config) ──────
# Uncomment CONFIG_SOURCE and add operational keys to override the API here.
# CONFIG_SOURCE=local
# DOCKER_IMAGE=bynikx/dory-core:v2
# MAX_CONCURRENT_RUNS=2
# CRAWLER_REDIS_URL=redis://host.docker.internal:6379
EOF

Tip: Replace API_BASE_URL and REDIS_URL with your actual values. The Redis instance must be the same one your dory-api connects to.

4. Authenticate

Get a worker key before starting (see Authentication):

dory-worker login        # or set WORKER_API_KEY in .env from an admin-issued key

5. Start the worker

cd ~/dory-worker
dory-worker

The worker will connect to Redis, start polling for jobs, and send heartbeats to the API every 10 seconds. You should see it appear on the Workers page in the Dory UI.

6. Run on boot (optional)

To keep the worker running after reboots, create a systemd service (Linux / Raspberry Pi):

sudo cat > /etc/systemd/system/dory-worker.service << EOF
[Unit]
Description=Dory Worker
After=network.target docker.service
Requires=docker.service

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/dory-worker
ExecStart=$(which dory-worker)
EnvironmentFile=$HOME/dory-worker/.env
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now dory-worker

Check status: sudo systemctl status dory-worker

On macOS, use a Launch Agent or just run dory-worker in a terminal / tmux session.

7. Multiple workers on the same machine (optional)

You can run multiple worker processes on the same device. Each gets a unique ID based on its PID, and BullMQ ensures no two workers pick up the same job.

# Terminal 1
WORKER_ID=worker-a dory-worker

# Terminal 2
WORKER_ID=worker-b MAX_CONCURRENT_RUNS=1 dory-worker

Or create separate .env files:

cd ~/dory-worker
dory-worker                          # uses .env (default)

cd ~/dory-worker-2
dory-worker                          # uses its own .env

All instances share the same Docker daemon and Redis queue — useful for splitting MAX_CONCURRENT_RUNS across processes or isolating different queue configurations.

Environment Variables

Variable	Required	Default	Description
`API_BASE_URL`		—	dory-api URL (reachable from Docker containers)
`REDIS_URL`	one of	—	Full Redis URL
`REDIS_HOST`	one of	`localhost`	Redis hostname
`REDIS_PORT`	—	`6379`	Redis port
`REDIS_PASSWORD`	—	—	Redis password
`CRAWLER_REDIS_URL`	distributed	—	Redis for the per-run crawler queue
`CONTAINER_COUNT`	—	`1`	Containers per job (overridden by dory-api per actor)
`MAX_CONCURRENT_RUNS`	—	`2`	Parallel BullMQ jobs
`WORKER_ID`	—	`dory-worker-{pid}`	Label shown in logs
`LOG_LEVEL`	—	`info`	`debug \| info \| warn \| error`
`DOCKER_IMAGE`	—	—	Fallback image if API doesn't return one
`QUEUE_RUN_EXECUTION`	—	`run-execution`	BullMQ queue name for scraping jobs — must match the API's value

GCS: workers and containers no longer use GCS. Results upload through dory-api, which is the only component holding the bucket and credentials — set GCS_BUCKET / GCP_PROJECT_ID / credentials on the API, not here.

How a Job Flows

dory-api enqueues a BullMQ job { runId } onto the run-execution queue (configurable via QUEUE_RUN_EXECUTION).
Worker picks up the job — calls GET /api/runs/:id/config to get actorConfig, dockerImage, containerCount, memoryLimitMb, actorTimeoutSecs.
Worker calls POST /api/runs/:id/status → { status: "running" }.
Worker calls docker run (once for single-container, N times for distributed). Each container receives:
- ACTOR_CONFIG — base64-encoded actor/user-input JSON
- API_BASE_URL — so the container can POST status callbacks
- CRAWLEE_MEMORY_MBYTES — from memoryLimitMb
- (distributed only) REDIS_URL, QUEUE_NAME, WORKER_ID, IDLE_TIMEOUT_SECS
Worker extends the BullMQ lock every 2 minutes while containers run.
Worker calls docker wait on all containers (in parallel). Uses the worst exit code.
Worker calls POST /api/runs/:id/status → { status: "completed"|"failed", exitCode } — only if no HTTP callback arrived (fallback).

containerCount precedence: dory-api /config response > CONTAINER_COUNT env var > default 1.

localhost rewriting: API_BASE_URL and CRAWLER_REDIS_URL containing localhost are automatically rewritten to host.docker.internal before being injected into containers.

Source Layout

src/
  cli.ts             Entry point — loads config, starts BullMQ Worker
  config.ts          WorkerConfig interface + loadConfig() from env vars
  worker.ts          BullMQ Worker setup, concurrency, graceful shutdown
  processor.ts       Core job handler — fetch config, spawn containers, wait
  docker.ts          docker run / docker wait wrappers; DistributedOpts
  logger.ts          Structured logger with log levels

test/
  harness.ts         Standalone test harness — mock API + real worker + Redis inspection
  redis-inspector.ts Post-run queue inspector — reads rq:* keys, returns metrics

scripts/
  run-all-tests.ts   13-scenario E2E suite runner → writes E2E-TEST-REPORT.md

test-image/
  Dockerfile         Minimal test image used by the harness in CI

Testing

Run a single scenario

# Minimal (single container, empty handlers — validates worker lifecycle)
npm test

# Real cheerio crawl (quotes.toscrape.com, 10 pages)
SCENARIO=real-crawl npm test

# Distributed mode (2 containers)
npm run test:distributed

# Distributed, 3 containers, 50 pages
DISTRIBUTED=true CONTAINER_COUNT=3 SCENARIO=dist-large npm test

# Deduplication — triplicate seed URLs
DISTRIBUTED=true SCENARIO=dedup npm test

# Retry on failure — handler throws on page 2
DISTRIBUTED=true SCENARIO=retry-failure npm test

# API failure resilience
SCENARIO=api-error EXPECT_FAILURE=true npm test

Valid SCENARIO values: minimal, real-crawl, large-crawl, distributed, dist-large, dedup, retry-failure, api-error, missing-redis.

Run the full 13-scenario suite

npm run test:all

Results are written to E2E-TEST-REPORT.md.

E2E test results (v1.0.3)

#	Category	Scenario	Result	Duration
T01	happy-path	Single-container · minimal		6.1s
T02	happy-path	Single-container · 10-page crawl		6.8s
T03	happy-path	Single-container · 50-page crawl		7.3s
T04	distribution	Distributed · 2 containers · 10 pages	0% skew	67.3s
T05	distribution	Distributed · 3 containers · 10 pages	10% skew	68.1s
T06	distribution	Distributed · 2 containers · 50 pages	0% skew	9.1s
T07	distribution	Distributed · 3 containers · 50 pages	0% skew	8.7s
T08	correctness	Deduplication · triplicate seed		66.6s
T09	correctness	Retry on failure · handler throws		67.0s
T10	resilience	API /config returns 500		3.1s
T11	resilience	Non-existent Docker image		2.5s
T12	resilience	Distributed · missing CRAWLER_REDIS_URL		2.1s
T13	resilience	containerCount precedence		7.1s

13/13 passed — 321.8s total. See E2E-TEST-REPORT.md for full metrics including per-worker URL counts, deduplication proof, and retry traces.

Building & Publishing

npm run build          # compile src/ → dist/
npm publish --access public

The published package exports dist/cli.js as the dory-worker binary.

Keywords

dory bullmq worker crawlee scraping