Finalize per-domain stats and completeness for staged domain crawls#1027
Finalize per-domain stats and completeness for staged domain crawls#1027aponb wants to merge 6 commits into
Conversation
Apply per-domain object and byte budgets to attributed seed domains, including embedded third-party resources and initial seed redirects. Also add coverage for attributed domain stats and limitReached behavior.
2446d3b to
bcc20ca
Compare
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
This PR combines work that was originally prepared as a 3-step series, but is submitted as a
single PR to keep review and follow-up changes in one place.
The overall goal is to support staged domain crawling, where later crawl stages can decide per
domain whether more work is needed or whether that domain is already finished.
This PR introduces three pieces that build on each other:
seed domain rather than only the literal host
In practice, this means:
The completeness semantics are intentionally split by crawl mode:
is a probe signal:
Implementation-wise, the PR also preserves the narrower depth-0 behavior while adding final
completeness handling for deep crawls, including conservative treatment of unresolved queued/
pending/failed work and crash-related uncertainty.
In short, this PR turns domain stats from descriptive reporting into a domain-level decision
signal that later crawl stages can use operationally.