-
Notifications
You must be signed in to change notification settings - Fork 763
Processing Chains
At the job level, a Heritrix crawl job has three main pipelines, known as Processor Chains (sequential application of swappable Processor modules -- see Processor Settings), with the Frontier acting as a buffer between the first two:
- The Candidates Chain:
- This processing incoming Crawl URIs, deciding whether to keep them (according to the Scope), and priming them to be deposited in the Frontier.
- See Candidate Chain Processors
- The Frontier:
- Crawl URIs accepted into this crawl are stored here in priority order, in a set of distinct queues.
- Usually, there is one queue per 'authority' (e.g.
example.com:80
), and the queue management ensures the desired crawl delay is honoured for each queue. - See Frontier
- The Fetch Chain:
- As Crawl URIs are emitted by the Frontier, the fetch chain processes each one and decides what to do with it, how to download it, etc.
- This chain also performs operations like link extraction.
- See Fetch Chain Processors
- The Disposition Chain:
- One the Fetch Chain has finished, any required post-processing is handled here.
- For example, this is where the downloaded resources are written into WARC files.
- See Disposition Chain Processors
Each URI taken off the Frontier queue runs through the processing chains. URIs are always processed in the order shown in the diagram below, unless a particular processor throws a fatal error or decides to stop the processing of the current URI.
Each processing chain is made
up of zero or more individual processors. For example, the FetchChain
might comprise the extractorCss
and extractorJs
processors. Within
a processing step, the order in which the processors are run is the
order in which they are listed in the crawler-beans.cxml
file.
HeritrixProcessorChains.png (image/png)
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse