0.2.0 • Published 19h ago

@perhapxin/dddk

Licence

AGPL-3.0-or-later

Version

0.2.0

Deps

Size

11.8 MB

Vulns

Weekly

Summary Dependency Versions

dotdotduck

Turn your existing site into an AI-native site.

An embedded AI SDK that lives inside your page and operates the DOM — not a chatbot bolted to the corner.

繁體中文 →

https://github.com/user-attachments/assets/18d797df-4952-421a-a2b3-16aef1ebcb34

01 · Command palette — every feature, behind one panel

Cmd+K palette open: /introduce, /theme, /language, /immersive_translate, #find-on-page, docs: search, Go to entries — all in one list

Ctrl/⌘+K opens it. Your commands sit alongside Ask AI in one list — switch theme, change language, find a customer, all from one place.
Inline mount or modal — same items. palette.mountInline(host) embeds the palette persistently inside a sidebar / drawer / dialog. Ctrl/⌘+K still raises the full modal on top; close restores the inline.
Rich rows. Items support lines: string[] (multi-line metadata) and image (thumbnail URL). Book covers, product shots, customer avatars without a custom renderer.
Prefix routing — /command, @entity, order:, #tag. One convenient entry point for everything the user is stuck on.
Layered customisation — CSS-variable theming, Skill SDK (Script / Prompt / Action / Surface / Panel), or wire existing host features in as palette items.
Zero built-in commands. What shows up is entirely yours. The SDK ships infrastructure; you ship the vocabulary.

02 · WebAgent — operates the page, not a sidebar chatbot

agent narrating its next step in the subtitle bar with space-continue / double-tap-exit / esc-cancel hint and confirm buttons

DOM-grounded autonomous loop. Reads the visible page, picks one tool at a time, narrates each step in the subtitle bar before it runs.
Opt-in action bundles. Default install is coreActions (5: narrate · navigate · click · border · scroll_to). Opt in to formActions (input / drag / hold_key / double_click / long_press), flowActions (wait / pause / ask_user), or extraActions (highlight / track_intent / escalate_to_human). Add your own; the LLM picks them.
Cursor on every action. With cursorTrail: true, a synthetic cursor glides onto each target before it fires — click / fill_input / border / scroll_to / narrate-with-about. Pre-action pause, arrival pulse, reduced-motion fallback.
Space-gated every step. Single tap accept · double-tap reject · Esc cancel. Users see what's about to happen before it happens.
Asks back when ambiguous. ask_user_choice for 2-4 options, ask_user for free text. No silent decisions.
Bring your own keys. LLM via OpenAI, Google AI Studio, or a server-side ProxyProvider. Per-role routing keeps cheap models on cleanup and the flagship on the agent loop. STT defaults to the browser's Web Speech; swap via transcribe(audio).

03 · Inline Agent — select text, AI without leaving the input

floating Edit with AI menu next to a textarea selection: Translate / Improve writing / Fix spelling & grammar / Make shorter / Make longer / Change to professional tone / Explain this

Highlight any text in any <input> / <textarea> / [contenteditable] — a floating toolbar appears below the selection. Pick an action, the result streams back in place of the selection.
Default action set out of the box — Translate, Improve writing, Fix spelling & grammar, Make shorter, Make longer, Change to professional tone, Explain this. Drop the defaults, add your own (/translate-with-glossary, /rewrite-as-email).
Two-column layout option for editor hosts that want a Format column next to an AI column. Optional keyboard shortcuts (e.g. Ctrl+Shift+R to rewrite without opening the menu).

04 · Direct manipulation — gestures you already know

Several physical entry points to send context into dddk. No new vocabulary to learn.

	A · Hold Space — voice in Focus inside an input → fills the input. Anywhere else → goes to the agent. Optional LLM cleanup pass — fillers + punctuation in one shot. STT swappable — defaults to the browser's Web Speech (no SLA, Firefox unsupported). One `VoiceConfig.transcribe` callback swaps in Whisper or any vendor.
	B · Long-press anything — Dwell Long-press any element for ~1s → frame pins around it. Next `Ctrl+K` opens the palette with that element as context. Visual elements (charts, images) ship an auto-screenshot.
	C · Drag a screenshot Click the camera in the palette → drag a rectangle on the page. Captured region attaches to your next Ask AI / agent question. Charts, dashboards, maps — show the AI exactly what you mean.
	D · `/introduce` — guided tour Declarative tours — list of `page` + `subtitle` + `action(tools)` steps. Space advances · Esc / double-Space exits. User reads at their pace. Write your onboarding / feature tour once · replay any time (palette command, proactive prompt, or programmatic).

05 · Mobile — FAB + your own buttons

mobile portrait view: dotdotduck FAB at bottom-right corner, 'Listening — release to send' indicator shown above the FAB while voice is active

Floating action button. Ships out of the box on mobile breakpoints — tap to open the palette, long-press to hold-talk into the agent.
Touch gestures. The same Space-hold / long-press / multi-choice patterns work via touch — tap → palette, long-press → voice, digit keys → option pick.
Your own button. Replace the duck FAB with any host element. Pass a selector or HTMLElement and dddk wires the open / hold-talk handlers onto it — the FAB lives wherever your design wants it (header bar, side rail, brand asset, etc.).
Responsive chrome. Subtitle bar auto-offsets above the on-screen keyboard; palette becomes full-width below 640px; tap targets respect the 44×44 touch-spec.

06 · Proactive — read the signal, ask the right question

subtitle bar showing yes/no prompt 'Your Monday order just shipped — want me to pull the tracking?' and multi-choice 'How should I handle this return?' with options

The agent subscribes to page signals — scroll depth, Dwell time, time-on-page, last interaction — and surfaces an offer in the subtitle bar when conditions match.
Yes / no resolves with Space. Single tap accepts, double-tap rejects. No popup chrome, no layout shift — the whole exchange stays in the subtitle bar.
Multi-choice with 1-9 number keys plus a trailing Other slot that always accepts free text. The user gets out without typing if your options covered it; if not, they can still answer.
Every response emits a typed intent (proactive_accepted / agent_choice with the picked value) so you measure what lands. Not big-data fishing — direct asking and recorded answers.
Customer-service plays out of the box — order just shipped → "Want me to pull the tracking?"; user lingers on the returns page → list three common actions.

07 · Intent stream — every yes / no is a signal, the dashboard writes itself

dotdotduck dashboard: 3 sessions, 2 visitors, 577 events, 118 palette opens, geography panel, top palette items table

dotdotduck dashboard — LLM streaming perf tile (avg TTFT, tok/s, duration) + by-model breakdown table + agent runs summary

Every interaction emits a typed event. Palette opens, voice transcripts, agent answers, accept / reject gestures, Dwell selections, multi-choice picks, even per-LLM-call streaming perf — all flow through one structured stream.
Event types include palette_activated · voice_captured · agent_asked / agent_answered (with latencyMs) · agent_run_started / completed / stopped · agent_pause_decision · agent_llm_call (TTFT, tokens/sec, model) · confirm_action · selection_used · skill_started / finished · agent_feedback. Add your own and they ride the same channel.
Clean behavioural data, no big-data fishing. You learn what users want from what they actually asked + answered, not from inferring through clickstreams.
Bundled dashboard route turns the stream into charts out of the box — yes-rate over time, TTFT / tok-per-sec by model, agent-run completion rate, top palette items, geography — or subscribe in code and pipe to Mixpanel / Amplitude / your own BI.

Why adopt — eight concrete plays

Most customer-service tickets are page-solvable. "How do I X" / "where do I Y" / "track my order" / "change my plan" — the answers all live on your site already; the gap is discoverability. A DOM-grounded agent that operates the page closes that gap. Deflect the easy 70% before they reach a human queue.
Proactive offers convert. Watching scroll · Dwell · time-on-page · last interaction lets the agent ask "want me to pull the tracking?" / "want a recommendation based on what you're looking at?" before the user thinks to ask. Subtitle-bar yes/no resolves in one keystroke — friction is the lowest physically possible. Same surface for cross-sell and upsell plays.
The palette is a UI surface, not just a text list. Each row's detail pane (and PanelSkills inside the palette) can render any Pieces tree — charts, tables, forms, mini-dashboards. That makes the palette a real productivity surface, not just a launcher:
- Finance — AAPL in the palette pulls a live price card + sparkline alongside the row.
- Customer service — type a question; the palette shows the matching FAQ entry with formatted answer inline, not a link to click.
- Tool-type SaaS — pack utilities (regex tester, JSON formatter, unit converter, internal lookup) straight into the palette so users never tab out. Same Ctrl+K, different verbs per product.
Long-press beats "screenshot + describe". With Dwell, the user holds an element, the agent gets selector + auto-screenshot in one gesture — chart, dashboard panel, table row, whatever. Users stop interrupting themselves to take a screenshot, paste it into chat, and write a paragraph explaining what they meant. Intent flows straight from finger to LLM.
Break the language wall with one palette command. Built-in immersive translate renders every paragraph of the current page bilingually side by side — one keystroke turns your English-only docs / knowledge base / product copy into a Chinese / Japanese / Korean / Spanish-readable surface. Batched into a handful of LLM calls per page (a 200-paragraph article costs ~7 calls). For cross-border SaaS, content platforms, or any product serving multiple regions, that's one fewer translation-engineering project on the roadmap.
One SDK instead of stitching six vendors. Palette + agent + inline AI + voice + Dwell + proactive + analytics + immersive translate ship as one install. The conventional alternative is Algolia for search, Intercom for chat, Mixpanel for analytics, Whisper for voice, plus the brittle glue code between them. dddk is one dependency, one theme system, one intent stream.
Yes / no / multi-choice = free RL labels. Every Space-accept and double-Space-reject is a clean, intentional signal — what the user actually wanted vs didn't, said by the user, recorded with the original prompt. No more inferring from clickstream noise. The training set for whatever you fine-tune or evaluate next is already collected.
Voice doesn't stop at the browser. The same Voice + utility LLM shape powers IoT panels, kiosk terminals, service machines, and accessibility-first surfaces for elderly users or anyone who'd rather not type. One mental model across every device that has a microphone.

v0.2.0 — what shipped

Architectural rework of the webagent core. One breaking change (coreActions is the default install, not all 12 builtin actions). Full notes: release-notes.md.

Cost validation — done. gpt-5.4-nano runs the full monolithic webagent loop with the same task-success rate as gpt-5.4-mini at roughly an order of magnitude lower cost. That's the new default for webagent + plan roles on dddk.perhapxin.com.

Highlights:

TaskAgent — third agent kind alongside WebAgent + InlineAgent. Conversation + host-defined tool calling, no DOM, plain protocol. ask() / streamAsk(). Same AgentSession shape so multiple TaskAgents share conversation history when wired to the same session.
WebAgent multi-instance + shared sessions — dddk.sessions named-session registry + dddk.agents named-instance registry. Inject the same AgentSession into different WebAgents (one persona per route) and dddk.agents.setActive(name) on route change.
Opt-in action bundles — default install is coreActions (5: narrate / navigate / click / border / scroll_to). Pass formActions / flowActions / extraActions to opt in. builtinActions kept as union for back-compat. (Breaking change.)
New actions — hold_key, double_click, long_press, drag, press_key extended with modifiers. narrate promoted from CoT-only primitive to first-class action in the registry.
Cursor on every action — cursorTrail: true now covers click / border / highlight / fill_input / scroll_to / narrate-with-about. scroll_to swaps cursor glyph to a mouse-wheel icon mid-scroll. New API: moveCursorTo(el), cursorPulse(), setCursorMode('pointer' | 'scroll' | 'reading').
Planner sees the DOM — every planning call now receives a current-page snapshot in hostContext, so the planner can spot routes / links visible on the page even when the briefed sitemap missed them. Cap via plannerDomMaxLength (default 8000).
Navigate path validation — navigate rejects paths not in the sitemap and returns the valid path list to the LLM for retry. Stops the loop from chasing hallucinated paths into 404s.
Streaming envelope parser — scanner-based incremental JSON parser. Each action dispatches the moment its tool-args { } balances, instead of waiting for the outer envelope to close. Opt in via enableStreamingEnvelope: true.
Live registry — webagent.registerTool(def) → ToolHandle and webagent.registerContextProvider(role, fn) → ContextProviderHandle. Handle's remove() unregisters; context-provider remove restores the SDK default rather than emptying the slot.
Context providers split — six slots (url, page_summary, dom, screenshot, history, selection) with default providers SDK-installed in the WebAgent constructor.
InlineAgent scoping — inlineAgent.attachScope(selector, config) for per-region action sets. Innermost-wins on the selection's anchor element; callback fallback via setScopeResolver.
onLoopEnd hook — agent-loop closure UI: silent / text / feedback (Space accepts · double-tap rejects · Esc nulls) / ask_user (closing question with options).
agent_tool_failed intent event — emitted whenever a tool handler returns { ok: false } or throws.
Inline palette + rich rows — dddk.palette.mountInline(host, opts?) persistently embeds the palette inside a host element (no backdrop). Ctrl/⌘+K raises the modal on top, close restores the inline. New PaletteItem.lines: string[] + image: string + submitButton: boolean.
Self-hosted analytics layer (@perhapxin/dddk/analytics) — IndexedDB-backed EventStore + toCSV / toNDJSON / toSQL exporters + function-based SqlSchemaMapper. Canonical dddk_events DDL ships for SQLite / Postgres / MySQL.
Mini dashboard (@perhapxin/dddk/analytics/dashboard) — renderDashboard(container, store) mounts six vanilla-SVG charts. EN / zh-TW labels, optional auto-refresh.
Session-lifecycle hardening — hard reload (F5 / Ctrl+R / Ctrl+Shift+R) always clears session regardless of sessionContinuityMs; default sessionContinuityMs flipped from 5 * 60 * 1000 to 0 (each ask is its own session unless host opts in).
Subtitle click/tap = Space — single click on the subtitle surface accepts; double-click rejects. Mouse / touch / pen all work.

v0.3 roadmap

Items consciously deferred from v0.2:

Cross-type session sharing with full re-serialization — TaskAgent reading WebAgent's session already works (CoT agent_step turns are silently skipped); the reverse (WebAgent reading TaskAgent's plain-chat turns and re-wrapping them as CoT envelope shape) is more work.
Multi-agent delegation — a TaskAgent calling a WebAgent (or vice versa) via a tool. Workable; introduces orchestrator-routing complexity that wants real use-case validation first.
buildMessages migration through provider registry — url / page_summary / history / selection / screenshot are consulted via providers; dom is still inline because currentIndexMap for selector resolution is coupled to the call site. Untangling that is mechanical but careful refactor.
TaskAgent tool-args incremental streaming — streamAsk already streams text deltas and toolCallStart / toolCallEnd markers. Streaming the tool arguments AS the LLM types them is on the roadmap.
Cross-tab session share for TaskAgent — WebAgent already crosstabs; TaskAgent doesn't yet.

v0.1.x bug fixes continue to ship on the v0.1.x branch.

Status — early stage, read before evaluating

dotdotduck is in active development. It works, but expect rough edges. A few things up front:

Clone the repo to evaluate properly. The bundled docs are useful as a map, but the source is the source of truth. git clone https://github.com/PerhapxinLab/dotdotduck into your project directory and read the code alongside the online docs — that's the recommended way to understand what's actually implemented.
The docs are AI-drafted. They're written and maintained with Claude Code. They stay close to the code by convention, but if something looks wrong, grep the repo before assuming the docs are right.
Found a bug or unclear behaviour? Open an issue at github.com/PerhapxinLab/dotdotduck/issues — one-liners help shape the roadmap.

What the live demo runs (not bundled with the package)

dddk.perhapxin.com doubles as dotdotduck's official landing page AND as the real-world test bed for the package — every release ships first to this site and gets exercised end-to-end before being tagged. The standing challenge: serve the demo well using the smallest viable model at each role, so the same recipe holds up when other teams adopt dddk on a cost budget. Expect the model picks below to keep shifting as smaller checkpoints catch up.

Current stack:

4-axis LLM router (webagent / vision / utility / plan) — host configures one model per role; the bundled demo runs OpenAI gpt-5.4-nano for the main agent loop and planner, gpt-5.4-mini for InlineAgent + voice cleanup.
Speech-to-text → the browser's Web Speech API (the SDK default; fine for demo, no SLA — production hosts wire transcribe with Whisper / Deepgram / etc.)

None of this is baked into @perhapxin/dddk. The package itself ships LLM provider adapters (OpenAI / Google / proxy, plus any OpenAI-compatible vendor via baseURL — e.g. DeepSeek, Qwen, OpenRouter) and a transcribe(audio) extension point. Bring your own keys, models, and ASR vendor — the SDK doesn't lock you in.

Documentation

What's new in v0.2.0 → release notes · migration guide
Full docs → dddk.perhapxin.com/docs
Agent (DOM-grounded loop + InlineAgent + sitemap + Memory) → /dddk/agent
LLM providers + router + adapter registry → /dddk/llm
Skills system + evals → /dddk/skills
Modules (voice / Dwell / inline / immersive translate / proactive / analytics) → /dddk/modules
Toolbox (search + recommend) → /dddk/toolbox
Theming → /dddk/theming

Install

pnpm add @perhapxin/dddk
# or: npm i @perhapxin/dddk

import { DotDotDuck, OpenAIProvider } from '@perhapxin/dddk';
import '@perhapxin/dddk/styles.css';

const dddk = new DotDotDuck({
  llm: new OpenAIProvider({
    apiKey: import.meta.env.VITE_OPENAI_KEY,
    model: 'gpt-5.4-mini',
  }),
  siteName: 'YourSaaS',
  skills: [
    {
      id: 'introduce',
      type: 'script',
      name: 'Tour the app',
      steps: [
        { subtitle: 'Welcome!', action: (t) => t.spotlight('.hero') },
        { subtitle: 'Here is pricing.', action: (t) => t.highlight('.pricing'), waitForUser: true },
      ],
    },
  ],
});

dddk.mount();

Press Ctrl/⌘+K, type /introduce, watch it run. The full quickstart guide covers React / Vue / Svelte / Solid wiring.

Theming

Everything visual reads from CSS custom properties — --dddk-bg, --dddk-accent, --dddk-radius, --dddk-font, and friends. Override at :root or scope inside any wrapper.

:root {
  --dddk-accent: #6366f1;       /* your brand colour */
  --dddk-radius: 10px;
  --dddk-font: 'Inter', system-ui, sans-serif;
}

Dark mode is automatic: [data-theme="dark"] anywhere up the tree, OR @media (prefers-color-scheme: dark) — whichever fires first. Custom modes (sepia, high-contrast, brand-specific) work by overriding the same variables under a new selector.

License

AGPL-3.0-or-later. See LICENSE for the full text.

Built by Perhapxin Team

Keywords

command-palette cmdk ai-agent browser-agent web-agent dom voice inline-ai dwell skill-sdk react svelte vue