-
Notifications
You must be signed in to change notification settings - Fork 763
Frontier Unbundling Design Details
Current things that happen to a discovered URI:
- scope rules applied – some URIs rejected (LinksScoper)
- URI passed to frontier for scheduling (FrontierScheduler)
- URI canonicalized (inside frontier sync; bottleneck)
- URI precedence value assigned (inside frontier sync; bottleneck)
- used to choose queue key (inside frontier)
There may be other bits of processing/classification that can occur to discovered, not-yet-fetched URIs, so setting all these steps up in a new configurable chain makes sense from a flexibility standpoint, from and offloading complexity from frontier standpoint, and decreasing serialized bottlenecks.
Can be moved to a late-in-processing-chain module:
- disposition disposition decision (success, retry, failure) moves to a dispositiondecision module late in processor chain
- politeness delay moves to politenesspolicy module in processing chain
We can refactor crawl-configuration to feature three chains of Processors.
Two are together analogous to the existing ProcessorChain, and apply to URIs that come off the frontier. The first chain is the 'fetch chain', and includes all steps that may be freely aborted/repeated in the case of checkpoints. The second chain is the 'disposition chain', and any URI which begins this chain should finish it, with all the attendant mutation of frontier queues and total stats,before a checkpoint is attempted. In the disposition chain, the former CrawlStateUpdater has been renamed DispositionProcessor and in addition to its previous robots/stats updating, now updates the CrawlURI with info (like calculated politeness delays) for the frontier to consider.
The third chain takes the place of steps which now happen in LinksScoper and FrontierScheduler and the frontier, for URIs that have not yet been scheduled to a queue. This chain can be called 'candidate chain'. In addition to scoping, it also prepares the CrawlURI with precalculated info for all frontier decisions -- allowing these policies to be applied in parallel, outside the critical frontier locks/managerThread.
The CandidateChain will usually contain only two processors (though of course more can be added):
- CandidateScoper: applies scoping to the one URI being processed, setting its fetchStatus negative if it should not be scheduled
- FrontierPreparer:
- calculates 'schedulingDirective' coarse prioritization
- calculates canonicalized URI
- calculates destination-queue key
- calculates 'cost'
- calculates URI precedence, if any
The candidate chain is applied to URIs:
- discovered during the crawl
The FrontierPreparer, however, is also available directly to the Frontier for help prepping any URIs that do not (yet) go through the CandidateChain, such as seeds or other URIs added other ways mid-crawl.
The CandidateChain is applies to each candidate URI in turn by a new processor, CandidatesProcessor, that replaces the LinksScoper and FrontierScheduler.
The FetchChain is the same as the first part of the general processing chain from H1/H2. It includes processors which:
- recheck scoping, if desired
- check/enforce preconditions
- try a fetch
- do link-extraction
- write ARC/WARC
The DispositionChain contains all steps which should be done atomically with regard to checkpointing. It will typically include just two processors:
- CandidatesProcessor: replaces the old LinksScoper and FrontierScheduler, running the candidate chain on each discovered URI, then scheduling (or applying special discovered-seed handling) for any URIs not cancelled by that chain
- DispositionProcessor: the renamed CrawlStateUpdater, updating stats, robots, and prefilling the CrawlURI with decisions/delays for the frontier to consult
(Some crawls may wish to move ARC/WARC writing to the disposition chain, if the risk of a small number of duplicate written-fetches after checkpoint-resumptions is a concern.)
s
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse