webEmbedding is a source-first website cloning engine for AI coding agents: it captures live pages with Playwright, replays network evidence from HAR artifacts, rebuilds only when direct reuse is blocked, and self-verifies the result.
It ships as a Skill + MCP server. Instead of asking a model to "clone this site" from a screenshot, it inspects the URL, chooses a reuse or rebuild route, captures DOM/runtime HTML/styles/assets/network traces, generates bounded frontend reconstruction artifacts, and checks the output with visual, DOM, computed-style, interaction, and responsive-breakpoint verification.
GitHub listing, social preview, and launch-copy recommendations are in docs/github-listing.md.
The current pipeline is strongest for static and semi-static web pages:
- company, brand, marketing, and documentation pages
- public landing pages
- iframe-blocked pages that need capture-based reconstruction
- responsive page snapshots across desktop, tablet, and mobile
It is not a full backend or app-logic clone engine. Login-only screens, app-first or native-app-required services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, booking flows, and private server behavior still need separate handling.
Operationally, the repo is now a production-candidate clone engine for URL-based capture and bounded reconstruction: jobs can be queued, network evidence can be replay-audited from HAR artifacts, authenticated dashboard runs can be driven from user-owned browser state, and local gates verify the route corpus, score checks, package contents, and CI wiring. The remaining hard boundary is server-side product behavior, not front-end evidence capture and reconstruction.
Recent local benchmark runs from this repo:
| URL | Path | Score |
|---|---|---|
https://developer.mozilla.org/en-US/ |
iframe-blocked bounded rebuild | root 94, visual 95, mobile 94, tablet 94, breakpoint average 94 |
https://www.mozilla.org/ |
bounded rebuild | root 94, visual 100 |
https://www.python.org |
harder bounded rebuild sample | root 90, visual 100 |
https://www.example.com |
exact reuse | ready yes |
These are generated by the local self-verify pipeline, not manually assigned ratings.
The reproducible commands and score thresholds are tracked in docs/benchmark-evidence.json.
Production readiness gates are tracked in docs/production-pipeline-gates.json.
- Source-first routing:
- direct iframe or embed reuse when it is safe and frameable
- original preview, export, remix, or source routes when available
- bounded rebuild only when exact reuse is unavailable
- Live browser capture:
- DOM snapshot
- runtime HTML
- full-page screenshot
- computed style summaries
- CSS analysis
- asset inventory
- HAR-like network metadata
- interaction states and replay traces
- storage state export for session-aware flows
- Blocked-site rebuild:
- handles
X-Frame-Optionsand CSP-blocked pages by rebuilding from captured evidence - generates reusable frontend reconstruction artifacts from captured page structure
- preserves custom tags, shadow-root host structure, and semantic document structure where captured
- handles
- Evidence limitation reporting:
- separates directly captured artifacts from inferred or missing evidence in reproduction results and prompts
- marks app-gated, auth-gated, and native-app-led surfaces as bounded evidence, with recommendations for user screenshots or authenticated session capture
- Operational failure classification:
- reports typed pipeline action codes such as
network-replay-limited,auth-session-missing,public-app-gate, andcanvas-visual-fallback - exposes HAR/network
replay_readinessbefore treating captured network evidence as replay-grade
- reports typed pipeline action codes such as
- Production pipeline helpers:
- filesystem-backed async clone job queue with durable JSON records, worker locks, retry scheduling, cancellation, and manifest annotation
- deterministic HAR replay engine for standard HAR, near-HAR, and captured
network/manifest.jsonartifacts - authenticated dashboard live corpus runner that accepts user-provided
storage_state_pathoruser_data_diroutside the repo
- Self-verification:
- screenshot similarity
- DOM snapshot similarity
- computed-style similarity
- hover/focus/click interaction state parity
- interaction trace parity
- desktop/mobile/tablet breakpoint reports
- Responsive benchmark support:
- primary desktop viewport:
1440x1200 - tablet profile:
768x1024 - mobile profile:
390x844
- primary desktop viewport:
- Repair loop:
- bounded self-repair can run when the first scaffold misses the readiness threshold
- Node.js 18 or newer
- Python 3.9 or newer
- Chrome or Chromium available locally for Playwright runtime capture
The package uses playwright-core; it does not download a browser by itself.
Installing this project adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server that exposes the URL inspection, capture, rebuild, and verification tools.
npm install -g web-embedding
web-embedding install
web-embedding doctorClone a public URL after installing:
web-embedding clone \
--url https://developer.mozilla.org/en-US/ \
--output-dir ./.tmp/mdn-clone \
--wait-seconds 2 \
--timeout-seconds 35 \
--breakpoints mobile tabletIf you already have an older local plugin installed, overwrite it with:
web-embedding install --force
web-embedding doctorYou can also run the installer without a global install:
npx web-embedding installFor MCP clients that can launch npm stdio servers:
{
"mcpServers": {
"source-first-clone": {
"command": "npx",
"args": ["-y", "web-embedding@latest", "mcp"]
}
}
}For local smoke testing:
npx web-embedding@latest mcpThe MCP Registry identity is io.github.jongko54/web-embedding; server.json and package.json#mcpName are kept in sync for registry ownership verification.
The public remote MCP intake endpoint for Apps SDK Developer Mode is:
https://webembedding-mcp.vercel.app/mcp
It exposes low-risk source-first routing tools such as URL inspection, embed candidate discovery, clone-mode classification, and embed snippet generation. Full browser capture, HAR replay, queues, bounded rebuilds, and one-pass clone execution remain local-first through the stdio MCP package.
Apps SDK review pages are hosted alongside the endpoint:
https://webembedding-mcp.vercel.app/privacy.html,
https://webembedding-mcp.vercel.app/terms.html, and
https://webembedding-mcp.vercel.app/submission.html.
webEmbedding has two different execution boundaries:
- Hosted Apps SDK intake: read-only URL routing and classification only. It accepts absolute
httpandhttpsURLs, does not run Playwright, does not read local files, does not use browser profiles or storage state, and does not persist capture artifacts. - Local stdio MCP and CLI: full capture, HAR replay, queues, rebuild scaffolds, and self-verify run on the user's machine under the user's local agent and filesystem permissions. Output is written only to caller-provided paths such as
output_dirorqueue_root. - Authenticated capture: session-aware runs require the caller to intentionally provide a
storage_state_pathoruser_data_dir. webEmbedding does not collect credentials, perform login bypasses, or treat a public login shell as private app evidence. - Access-controlled surfaces: paywalls, captcha flows, private dashboards, payment/checkout/account/admin flows, and native-app-led screens should be blocked, marked
needs_session, or sent to manual review unless the user has explicit authorization and supplies the needed evidence.
Local URL entrypoints reject non-HTTP schemes such as file:// so an agent cannot use clone/capture tools as a local file reader. Telemetry is disabled by default and, when enabled, excludes target URLs, local paths, captured HTML, screenshots, storage state, environment variables, API keys, and command output.
This repository includes marketplace metadata for the two local agent surfaces:
- Codex:
.agents/plugins/marketplace.jsonpoints to./bundle/source-first-clone. - Claude Code:
.claude-plugin/marketplace.jsonpoints to the same bundle and the bundle includes.claude-plugin/plugin.json.
Claude Code users can add the marketplace from GitHub with:
/plugin marketplace add jongko54/webEmbedding
/plugin install source-first-clone@webembedding
AI auto-selection expectations and golden prompts live in docs/ai-distribution.md and evals/ai-selection/webembedding-golden-prompts.json.
curl -fsSL https://github.com/jongko54/webEmbedding/releases/latest/download/install.sh | bashgit clone https://github.com/jongko54/webEmbedding.git
cd webEmbedding
npm install
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctorUseful for testing without touching your real agent home:
python3 python/web_embedding/installer.py install --target-home ./.tmp/home
python3 python/web_embedding/installer.py doctor --target-home ./.tmp/home
python3 python/web_embedding/installer.py uninstall --target-home ./.tmp/homeTelemetry is disabled by default. On an interactive first install, web-embedding install asks once and defaults to No. Non-interactive installs such as CI and curl | bash do not prompt. If you opt in, web-embedding sends a small anonymous command-completion event to a JSON POST endpoint you control. It does not send target URLs, local paths, captured HTML, screenshots, storage state, environment variables, API keys, or command output.
Enable it during install:
web-embedding install --telemetry --telemetry-endpoint https://your-collector.example/eventsOr manage it later:
web-embedding telemetry enable --endpoint https://your-collector.example/events
web-embedding telemetry status
web-embedding telemetry disable
web-embedding telemetry reset-idEach event contains an anonymous install id, package version, command name, success/failure status, OS/runtime basics, and coarse option flags such as breakpoint_count or install_source.
Environment controls:
WEB_EMBEDDING_TELEMETRY=1
WEB_EMBEDDING_NO_TELEMETRY=1
WEB_EMBEDDING_TELEMETRY_PROMPT=0
WEB_EMBEDDING_TELEMETRY_ENDPOINT=https://your-collector.example/events
WEB_EMBEDDING_TELEMETRY_LOG=./telemetry.jsonlRun a local/self-hosted JSONL collector:
npm run telemetry:collector -- --host 127.0.0.1 --port 8765 --out ./telemetry.jsonl
WEB_EMBEDDING_TELEMETRY=1 \
WEB_EMBEDDING_TELEMETRY_ENDPOINT=http://127.0.0.1:8765/events \
web-embedding doctorSummarize collected usage:
npm run telemetry:summarize -- ./telemetry.jsonlThe summary includes install and clone executions, total command executions, unique anonymous install IDs, command counts, and version counts. See docs/telemetry.md for collector and analyzer details.
Inspect a URL and get route hints:
node ./bin/web-embedding.mjs inspect \
--url https://developer.mozilla.org/en-US/Run a safe preflight audit before capture or clone:
node ./bin/web-embedding.mjs audit \
--url https://developer.mozilla.org/en-US/The audit reports whether the reference is ready for exact/embed reuse, needs local capture, needs an authenticated session, requires manual review, or should be blocked before any browser capture or filesystem output runs.
Run the full clone workflow:
node ./bin/web-embedding.mjs clone \
--url https://developer.mozilla.org/en-US/ \
--output-dir ./.tmp/mdn-clone \
--wait-seconds 2 \
--timeout-seconds 35 \
--breakpoints mobile tabletRun a lightweight quality benchmark:
python3 scripts/check_clone_quality_bench.py \
https://developer.mozilla.org/en-US/ \
--output-root ./.tmp/clone-quality-bench \
--wait-seconds 1 \
--timeout-seconds 35 \
--breakpoints mobile tabletThe benchmark prints compact rows for root, visual, and breakpoint scores. The full artifacts are written under the output directory.
node ./bin/web-embedding.mjs capabilities
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor
node ./bin/web-embedding.mjs uninstall
node ./bin/web-embedding.mjs paths
node ./bin/web-embedding.mjs telemetry statusnode ./bin/web-embedding.mjs inspect --url https://www.mozilla.org/node ./bin/web-embedding.mjs audit --url https://www.mozilla.org/node ./bin/web-embedding.mjs capture \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/capture-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs reproduce \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/reproduce-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs clone \
--url https://www.mozilla.org/ \
--output-dir ./.tmp/clone-mozilla \
--breakpoints mobile tabletnode ./bin/web-embedding.mjs verify \
--reference-bundle ./.tmp/reference/capture.json \
--candidate-bundle ./.tmp/candidate/capture.jsonA clone run can produce:
capture.jsonpipeline-run-manifest.jsondom/snapshot.jsondom/runtime.htmlstyles/computed-summary.jsonstyles/css-analysis.jsonnetwork/manifest.jsonnetwork/har.jsonnetwork/har-like.jsonnetwork/replay-report.jsonassets/inventory.jsoninteractions/states.jsoninteractions/trace.jsonscreenshots/runtime.pngsession/storage-state.jsonreproduction/plan.jsonreproduction/evidence-limitations.jsonreproduction/rebuild-prompt.txtreproduction/rebuild/starter.htmlreproduction/rebuild/starter.cssreproduction/rebuild/starter.tsxreproduction/rebuild/next-app/reproduction/self-verify/summary.jsonreproduction/self-verify/renderers/*/verification.jsonreproduction/self-verify/renderers/*/visual-qa.jsonreproduction/self-verify/renderers/*/breakpoints/*-verification.json
Run the default small benchmark:
npm run check:clone-bench:localRun the universal route regression corpus and expectations gate:
npm run check:benchmark-routes:localRun a lightweight clone score gate:
npm run check:clone-score-gate:localValidate the committed benchmark evidence manifest:
npm run check:benchmark-evidence:localValidate production pipeline gates:
npm run check:production-readiness:localRun the operational smokes individually:
npm run check:job-queue:local
npm run check:har-replay:local
npm run check:authenticated-corpus:localClassify failure/action codes from a route report:
npm run classify:pipeline-failures -- --report ./.tmp/universal-route-benchmark/universal-route-report.jsonFind low-scoring persisted benchmark artifacts:
npm run summarize:benchmark-scores -- --root ./.tmp --min-score 60 --max-score 70Run specific URLs:
python3 scripts/check_clone_quality_bench.py \
https://www.example.com \
https://www.mozilla.org/ \
--no-breakpointsRun a responsive benchmark:
python3 scripts/check_clone_quality_bench.py \
https://developer.mozilla.org/en-US/ \
--breakpoints mobile tabletpython3 -m py_compile \
bundle/source-first-clone/mcp/source_first_clone/*.py \
scripts/check_integration_smoke.py \
scripts/check_clone_quality_bench.pynpm run check:integration:localgit diff --checkbundle/source-first-cloneInstalled plugin bundle, MCP server, and exact-clone intake skill.bundle/source-first-clone/mcp/source_first_cloneCapture, planning, rebuild, repair, and verification engine.bin/web-embedding.mjsNode CLI wrapper.python/web_embedding/installer.pyShared installer and command dispatcher.scripts/check_clone_quality_bench.pyURL clone quality benchmark helper.scripts/benchmark_routes.pyUniversal route/capture-depth regression benchmark helper.scripts/check_benchmark_report.pyBenchmark expectation validator for exact, minimum, and contains-style checks.scripts/check_benchmark_evidence.pyBenchmark evidence manifest validator.scripts/check_job_queue_smoke.pyFilesystem async clone job queue smoke test.scripts/check_har_replay_smoke.pyDeterministic HAR replay engine smoke test.scripts/benchmark_authenticated_corpus.pyUser-provided authenticated dashboard corpus runner.scripts/summarize_benchmark_scores.pyUtility for finding low or high scoring persisted benchmark artifacts under an output root.scripts/classify_pipeline_failures.pyOperational failure/action taxonomy summarizer for reports and capture artifacts.scripts/check_production_readiness.pyProduction readiness gate validator for corpus, failure taxonomy, CI wiring, and policy docs.scripts/check_integration_smoke.pyRelease, install, and URL-only clone smoke test.scripts/release_bundle.pyRelease artifact builder.docs/Architecture notes and universal benchmark documentation.
The strongest claim for this project is:
A source-first website cloning engine that combines Playwright capture, HAR replay, MCP tools, and self-verification to rebuild iframe-blocked public pages with reproducible visual, DOM, style, interaction, and responsive scores.
Avoid treating the output as a legal or ownership bypass. The engine can reconstruct public page structure, but permission, licensing, and acceptable use still matter.
MIT
