npm.io
1.1.8 • Published 3d agoCLI

@nikx/dory-worker

Licence
ISC
Version
1.1.8
Deps
2
Size
226 kB
Vulns
0
Weekly
2.9K

dory-worker

BullMQ job consumer for the Dory web scraping platform. Runs on any machine with Docker — including a Raspberry Pi. Pulls scraping jobs off a shared Redis queue and executes them by launching dory-core containers locally.

npm: @nikx/dory-worker@1.0.3


Architecture

dory-api (Railway)
  └─ enqueues job → BullMQ (Redis)
       └─ dory-worker (your home machine / Pi)
            │  GET /api/runs/:id/config
            │  POST /api/runs/:id/status  (running / completed / failed)
            │
            ├─ Single-container mode  (containerCount = 1)
            │    └─ docker run dory-core:v2
            │         └─ Crawlee in-memory queue
            │
            └─ Distributed mode  (containerCount > 1)
                 ├─ docker run dory-core:v2 × N
                 │    ├─ REDIS_URL=redis://host.docker.internal:6379
                 │    ├─ QUEUE_NAME=<runId>   ← job-scoped, isolated
                 │    ├─ WORKER_ID=worker-1..N
                 │    └─ IDLE_TIMEOUT_SECS=60
                 │
                 └─ Shared Redis queue  (rq:<queueId>:*)
                      ├─ :meta      queue metadata
                      ├─ :requests  all URLs ever added (Hash)
                      ├─ :ordering  Lua-locked sorted set
                      └─ :handled   completed requestIds (Set)

Multi-Worker Coordination (Helper-Join Model)

When multiple workers are running, they automatically coordinate to share work on distributed jobs:

Worker-1 picks up job from BullMQ
  └─ Becomes coordinator → spawns container(s)
  └─ POST /api/runs/:id/status { status: "running", coordinatorWorkerId: "worker-1" }

Worker-2, Worker-3 (heartbeat loop every 10s)
  └─ GET /api/runs/active/distributed → discovers active run
  └─ POST /api/runs/:id/join → becomes helper participant
  └─ Spawns own container(s) sharing the same Redis crawl queue
  └─ On exit → POST /api/runs/:id/leave

Key behaviors:

  • Workers discover active distributed runs via heartbeat polling (every 10 seconds)
  • Each worker can participate in multiple concurrent runs (limited by MAX_CONCURRENT_RUNS)
  • Helpers exit independently when the shared queue drains (idle timeout)
  • Coordinator waits for all helpers to leave before reporting final status
  • Schedule deduplication: if a scheduled job triggers while a previous run is still queued/running, the new trigger is skipped
Distributed queue internals
Concern Mechanism
Deduplication SHA-256(uniqueKey).slice(0,15) → requestId; HGET :requests guard before any write
Atomic locking Lua script LUA_LIST_AND_LOCKZADD score = ±lockExpiresAt; no two containers claim the same URL
Retry Crawlee increments retryCount, re-enqueues until maxRequestRetries; exhausted → SADD :handled + errorMessages
Idle shutdown IDLE_TIMEOUT_SECS = min(15, actorTimeoutSecs / 2); containers exit cleanly when the queue drains

Prerequisites

  • Node.js ≥ 20
  • Docker (with access to dory-core:v2 image — build locally or pull from registry)
  • Redis (local container or remote — same instance used by dory-api)

Quick Start

Install Node.js ≥ 20.6, then install the worker globally, log in, and run the interactive setup (installs Docker, writes .env, pre-pulls the image, runs health checks, and starts the worker as a systemd/launchd service):

npm install -g @nikx/dory-worker
dory-worker login        # authenticate to the API (required before setup)
dory-worker setup

dory-worker login asks for your dory-api URL and your Dory username/password, then mints and stores a per-worker API key. setup (and start) refuse to run until you've logged in. See Authentication below.

dory-worker is a full management CLI — run it with no subcommand to start the worker in the foreground, or use any of:

dory-worker login        # authenticate to the API and store a worker key
dory-worker logout       # remove the stored worker key
dory-worker whoami       # show the current login
dory-worker setup        # full one-command install + configure + start service
dory-worker configure    # (re)write the .env interactively
dory-worker start        # run in the foreground (Ctrl+C to stop)
dory-worker start-bg     # install + start the background service
dory-worker status       # service status
dory-worker logs         # tail live logs
dory-worker health       # check Docker / API / Redis connectivity
dory-worker update       # update to the latest published version
dory-worker restart      # restart after a config change
dory-worker stop         # stop the background service
dory-worker uninstall    # remove the background service

Authentication

The API requires every worker to authenticate with a per-worker API key (dwk_…). The worker sends it as an X-API-Key header on all API calls, and passes it to the scraping containers it spawns so their result uploads are authenticated too. Without a key, the worker can't fetch config, heartbeat, or join runs — setup/configure/start/start-bg refuse to run until you have one.

Getting a key — two ways

1. dory-worker login (typical). You're on the worker machine and have a Dory login:

dory-worker login
# Dory API URL [http://localhost:4500]: https://your-api.example.com
# Username: alice
# Password: ********
# Worker label (optional): prod-eu-1

This verifies your credentials, mints a key, and saves it (with the API URL and label) to ~/.dory/credentials (mode 0600). Then run dory-worker setup.

Non-interactive (CI/scripts):

dory-worker login --api https://your-api.example.com -u alice -p "$DORY_PASSWORD" --label ci-runner
# or set DORY_USERNAME / DORY_PASSWORD in the environment

2. Admin-generated key. An admin generates a key in the UI (Worker Keys → Generate key) and gives you the dwk_… string. You don't need a Dory login — set it as an env var instead of running login:

export WORKER_API_KEY=dwk_...
dory-worker setup

WORKER_API_KEY takes precedence over ~/.dory/credentials, which is handy for containers and unattended deploys.

Managing keys
  • dory-worker whoami — show the current login (user, API, label)
  • dory-worker logout — remove the stored key from ~/.dory/credentials
  • Admins can list and revoke keys in the UI (Worker Keys page). Revoking a key cuts the worker off immediately — it must login again (or get a new key) to reconnect.

Manual setup (advanced)

If you'd rather wire things up yourself instead of using dory-worker setup:

1. Install globally
npm install -g @nikx/dory-worker

This gives you the dory-worker CLI command.

2. Pull the scraper image
docker pull bynikx/dory-core:v2
3. Create a .env file

Create a .env file in the directory where you'll run the worker:

mkdir -p ~/dory-worker && cd ~/dory-worker

cat > .env << 'EOF'
# Only BOOTSTRAP keys go here. Operational config (Docker image, concurrency,
# container count, crawler Redis, log level) is served by the API and applied
# at startup / hot-reloaded — manage it in the dashboard, not here. Results
# upload through the API, so workers need no GCS bucket or GCP credentials.

# ─── Required ──────────────────────────────────────────────────────────
# Public URL of your dory-api instance. MUST match the API URL you used with
# `dory-worker login` — the worker key is tied to that API, so a mismatch means
# authenticating to one API while talking to another. (The guided `setup`/
# `configure` flow sets this automatically from your login; you only set it by
# hand on this manual path.)
API_BASE_URL=https://your-api.railway.app

# ─── Redis (same instance used by dory-api) ───────────────────────────
# Option A: full URL (recommended)
REDIS_URL=redis://default:password@your-redis-host:6379

# Option B: host + port (uncomment these instead of REDIS_URL)
# REDIS_HOST=your-redis-host
# REDIS_PORT=6379
# REDIS_PASSWORD=your-password

# BullMQ queue name — must match the API's QUEUE_RUN_EXECUTION
QUEUE_RUN_EXECUTION=run-execution

# Optional: friendly name for this worker (defaults to dory-worker-{pid}).
# Also used to target this node with a per-worker config override in the API.
# WORKER_ID=pi-worker-01

# ─── Auth ──────────────────────────────────────────────────────────────
# Per-worker API key. Normally you DON'T set this here — `dory-worker login`
# stores it in ~/.dory/credentials. Set it explicitly only for unattended
# deploys (containers/CI) or when an admin generated a key for you in the UI.
# It takes precedence over ~/.dory/credentials.
# WORKER_API_KEY=dwk_...

# ─── Optional: pin this node to local config (ignore API config) ──────
# Uncomment CONFIG_SOURCE and add operational keys to override the API here.
# CONFIG_SOURCE=local
# DOCKER_IMAGE=bynikx/dory-core:v2
# MAX_CONCURRENT_RUNS=2
# CRAWLER_REDIS_URL=redis://host.docker.internal:6379
EOF

Tip: Replace API_BASE_URL and REDIS_URL with your actual values. The Redis instance must be the same one your dory-api connects to.

4. Authenticate

Get a worker key before starting (see Authentication):

dory-worker login        # or set WORKER_API_KEY in .env from an admin-issued key
5. Start the worker
cd ~/dory-worker
dory-worker

The worker will connect to Redis, start polling for jobs, and send heartbeats to the API every 10 seconds. You should see it appear on the Workers page in the Dory UI.

6. Run on boot (optional)

To keep the worker running after reboots, create a systemd service (Linux / Raspberry Pi):

sudo cat > /etc/systemd/system/dory-worker.service << EOF
[Unit]
Description=Dory Worker
After=network.target docker.service
Requires=docker.service

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/dory-worker
ExecStart=$(which dory-worker)
EnvironmentFile=$HOME/dory-worker/.env
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now dory-worker

Check status: sudo systemctl status dory-worker

On macOS, use a Launch Agent or just run dory-worker in a terminal / tmux session.

7. Multiple workers on the same machine (optional)

You can run multiple worker processes on the same device. Each gets a unique ID based on its PID, and BullMQ ensures no two workers pick up the same job.

# Terminal 1
WORKER_ID=worker-a dory-worker

# Terminal 2
WORKER_ID=worker-b MAX_CONCURRENT_RUNS=1 dory-worker

Or create separate .env files:

cd ~/dory-worker
dory-worker                          # uses .env (default)

cd ~/dory-worker-2
dory-worker                          # uses its own .env

All instances share the same Docker daemon and Redis queue — useful for splitting MAX_CONCURRENT_RUNS across processes or isolating different queue configurations.


Environment Variables

Variable Required Default Description
API_BASE_URL dory-api URL (reachable from Docker containers)
REDIS_URL one of Full Redis URL
REDIS_HOST one of localhost Redis hostname
REDIS_PORT 6379 Redis port
REDIS_PASSWORD Redis password
CRAWLER_REDIS_URL distributed Redis for the per-run crawler queue
CONTAINER_COUNT 1 Containers per job (overridden by dory-api per actor)
MAX_CONCURRENT_RUNS 2 Parallel BullMQ jobs
WORKER_ID dory-worker-{pid} Label shown in logs
LOG_LEVEL info debug | info | warn | error
DOCKER_IMAGE Fallback image if API doesn't return one
QUEUE_RUN_EXECUTION run-execution BullMQ queue name for scraping jobs — must match the API's value

GCS: workers and containers no longer use GCS. Results upload through dory-api, which is the only component holding the bucket and credentials — set GCS_BUCKET / GCP_PROJECT_ID / credentials on the API, not here.


How a Job Flows

  1. dory-api enqueues a BullMQ job { runId } onto the run-execution queue (configurable via QUEUE_RUN_EXECUTION).
  2. Worker picks up the job — calls GET /api/runs/:id/config to get actorConfig, dockerImage, containerCount, memoryLimitMb, actorTimeoutSecs.
  3. Worker calls POST /api/runs/:id/status{ status: "running" }.
  4. Worker calls docker run (once for single-container, N times for distributed). Each container receives:
    • ACTOR_CONFIG — base64-encoded actor/user-input JSON
    • API_BASE_URL — so the container can POST status callbacks
    • CRAWLEE_MEMORY_MBYTES — from memoryLimitMb
    • (distributed only) REDIS_URL, QUEUE_NAME, WORKER_ID, IDLE_TIMEOUT_SECS
  5. Worker extends the BullMQ lock every 2 minutes while containers run.
  6. Worker calls docker wait on all containers (in parallel). Uses the worst exit code.
  7. Worker calls POST /api/runs/:id/status{ status: "completed"|"failed", exitCode } — only if no HTTP callback arrived (fallback).

containerCount precedence: dory-api /config response > CONTAINER_COUNT env var > default 1.

localhost rewriting: API_BASE_URL and CRAWLER_REDIS_URL containing localhost are automatically rewritten to host.docker.internal before being injected into containers.


Source Layout

src/
  cli.ts             Entry point — loads config, starts BullMQ Worker
  config.ts          WorkerConfig interface + loadConfig() from env vars
  worker.ts          BullMQ Worker setup, concurrency, graceful shutdown
  processor.ts       Core job handler — fetch config, spawn containers, wait
  docker.ts          docker run / docker wait wrappers; DistributedOpts
  logger.ts          Structured logger with log levels

test/
  harness.ts         Standalone test harness — mock API + real worker + Redis inspection
  redis-inspector.ts Post-run queue inspector — reads rq:* keys, returns metrics

scripts/
  run-all-tests.ts   13-scenario E2E suite runner → writes E2E-TEST-REPORT.md

test-image/
  Dockerfile         Minimal test image used by the harness in CI

Testing

Run a single scenario
# Minimal (single container, empty handlers — validates worker lifecycle)
npm test

# Real cheerio crawl (quotes.toscrape.com, 10 pages)
SCENARIO=real-crawl npm test

# Distributed mode (2 containers)
npm run test:distributed

# Distributed, 3 containers, 50 pages
DISTRIBUTED=true CONTAINER_COUNT=3 SCENARIO=dist-large npm test

# Deduplication — triplicate seed URLs
DISTRIBUTED=true SCENARIO=dedup npm test

# Retry on failure — handler throws on page 2
DISTRIBUTED=true SCENARIO=retry-failure npm test

# API failure resilience
SCENARIO=api-error EXPECT_FAILURE=true npm test

Valid SCENARIO values: minimal, real-crawl, large-crawl, distributed, dist-large, dedup, retry-failure, api-error, missing-redis.

Run the full 13-scenario suite
npm run test:all

Results are written to E2E-TEST-REPORT.md.

E2E test results (v1.0.3)
# Category Scenario Result Duration
T01 happy-path Single-container · minimal 6.1s
T02 happy-path Single-container · 10-page crawl 6.8s
T03 happy-path Single-container · 50-page crawl 7.3s
T04 distribution Distributed · 2 containers · 10 pages 0% skew 67.3s
T05 distribution Distributed · 3 containers · 10 pages 10% skew 68.1s
T06 distribution Distributed · 2 containers · 50 pages 0% skew 9.1s
T07 distribution Distributed · 3 containers · 50 pages 0% skew 8.7s
T08 correctness Deduplication · triplicate seed 66.6s
T09 correctness Retry on failure · handler throws 67.0s
T10 resilience API /config returns 500 3.1s
T11 resilience Non-existent Docker image 2.5s
T12 resilience Distributed · missing CRAWLER_REDIS_URL 2.1s
T13 resilience containerCount precedence 7.1s

13/13 passed — 321.8s total. See E2E-TEST-REPORT.md for full metrics including per-worker URL counts, deduplication proof, and retry traces.


Building & Publishing

npm run build          # compile src/ → dist/
npm publish --access public

The published package exports dist/cli.js as the dory-worker binary.

Keywords