Skip to content

Finalize per-domain stats and completeness for staged domain crawls#1027

Open
aponb wants to merge 6 commits into
webrecorder:mainfrom
aponb:domain-stats-completeness-finalize-deep-crawls
Open

Finalize per-domain stats and completeness for staged domain crawls#1027
aponb wants to merge 6 commits into
webrecorder:mainfrom
aponb:domain-stats-completeness-finalize-deep-crawls

Conversation

@aponb
Copy link
Copy Markdown
Contributor

@aponb aponb commented Apr 30, 2026

This PR combines work that was originally prepared as a 3-step series, but is submitted as a
single PR to keep review and follow-up changes in one place.

The overall goal is to support staged domain crawling, where later crawl stages can decide per
domain whether more work is needed or whether that domain is already finished.

This PR introduces three pieces that build on each other:

  • per-domain attribution for discovered URLs and captured resources, based on the originating
    seed domain rather than only the literal host
  • per-domain reporting and limit enforcement via reports/domainStats.json
  • domain completeness reporting for both shallow domain probes and deep domain crawls

In practice, this means:

  • redirect-derived seeds keep the domain attribution of the original seed
  • per-domain object and byte counts are tracked
  • domain-level limits are enforced against the attributed domain model
  • completeness state is persisted across save/restore and stop/resume
  • depth-0 domain crawls can report whether a domain appears to need later follow-up
  • deep domain crawls can report a final domain-level end state

The completeness semantics are intentionally split by crawl mode:

  • for scopeType=domain, depth=0, --writeDomainStats, and --domainStatsCompleteness, completeness
    is a probe signal:
    • complete: no further theoretical in-scope next-hop candidates were found
    • incomplete: further theoretical in-scope next-hop candidates were found
    • unknown: the result is not reliable enough to conclude safely
  • for deep domain crawls, completeness is a final crawl outcome:
    • complete: the domain ran out of in-scope work cleanly
    • incomplete: the crawl stopped while domain work remained, or limits cut the domain off
    • unknown: external failures prevented a reliable conclusion

Implementation-wise, the PR also preserves the narrower depth-0 behavior while adding final
completeness handling for deep crawls, including conservative treatment of unresolved queued/
pending/failed work and crash-related uncertainty.

In short, this PR turns domain stats from descriptive reporting into a domain-level decision
signal that later crawl stages can use operationally.

aponb added 4 commits May 6, 2026 17:16
Apply per-domain object and byte budgets to attributed seed domains,
including embedded third-party resources and initial seed redirects.
Also add coverage for attributed domain stats and limitReached behavior.
@aponb aponb force-pushed the domain-stats-completeness-finalize-deep-crawls branch from 2446d3b to bcc20ca Compare May 6, 2026 15:20
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 7, 2026

Warning

Review the following alerts detected in dependencies.

According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.

Action Severity Alert  (click "▶" to expand/collapse)
Warn Critical
Critical CVE: Basic FTP has Path Traversal Vulnerability in its downloadToDir() method in npm basic-ftp

CVE: GHSA-5rq4-664w-9x2c Basic FTP has Path Traversal Vulnerability in its downloadToDir() method (CRITICAL)

Affected versions: < 5.2.0

Patched version: 5.2.0

From: ?npm/puppeteer-core@24.38.0npm/puppeteer@24.4.0npm/lighthouse@12.5.1npm/basic-ftp@5.0.5

ℹ Read more on: This package | This alert | What is a critical CVE?

Next steps: Take a moment to review the security alert above. Review the linked package source code to understand the potential risk. Ensure the package is not malicious before proceeding. If you're unsure how to proceed, reach out to your security team or ask the Socket team for help at support@socket.dev.

Suggestion: Remove or replace dependencies that include known critical CVEs. Consumers can use dependency overrides or npm audit fix --force to remove vulnerable dependencies.

Mark the package as acceptable risk. To ignore this alert only in this pull request, reply with the comment @SocketSecurity ignore npm/basic-ftp@5.0.5. You can also ignore all packages with @SocketSecurity ignore-all. To ignore an alert for all future pull requests, use Socket's Dashboard to change the triage state of this alert.

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant