-
Notifications
You must be signed in to change notification settings - Fork 763
Heritrix BdbFrontier
The BdbFrontier visits URIs and discovered sites in a generally breadth-first manner. It offers configuration options for controlling how it throttles activity against particular hosts. Other configuration options allow a bias towards finishing hosts in progress ("site-first" crawling) or cycling among all hosts with pending URIs.
Discovered URIs are only crawled once. The exception is robots.txt
and DNS information, which can be configured so that it is refreshed at
specific intervals for each host.
As of Heritrix 3.1, there are two new properties.
Property Name |
Default |
Description |
---|---|---|
largestQueuesCount |
20 |
This property controls how many of the largest queues are tracked and reported in the "frontier report". |
maxQueuesPerReportCategory |
2000 |
This property controls the maximum number of queues per category listed in the "frontier report". |
Note that the largest-queues information may be approximate when queues shrink out of the top-N or the value is changed mid-crawl. The list is updated only when a queue grows into the largest group.
The storage of the queue is managed via an embedded BDB JE database. This is a simple Key-Value store, so the multitude of queues are implemented as key prefixes. Each CrawlURI is stored in the BDB database as a binary blob serialised using Kryo, under a key that combines the queue prefix (classKey) and the crawl priority of the CrawlURI.
The list of active/snoozed/etc. queues is held in memory, and written to disk during checkpointing in JSON format. If you resume from a checkpoint, the BdbFrontier is reused, but the necessary queue information comes from the JSON files. If the frontier database is reused without resuming from a checkpoint, the database is 'truncated' and all the data in the BdbFrontier is discarded.
Updating the contents of the frontier from multiple threads requires care, as changes have to be made and committed to disk without conflicts occurring. Once any changes have been made to e.g. a WorkQueue, the wq.makeDirty()
call is used to initiate a process where the WorkQueue is serialised out to disk and read back in again (to ensure consistency, but dropping any transient fields). This means updates to each WorkQueue must be synchronised across threads so no two updates are happening at the same time. e.g.
synchronized (wq) {
...do updates...
wq.makeDirty();
}
This is all done using a ObjectIdentityBdbManualCache
which makes it possible to interact with the database as a simple collection while only holding in memory those things that are needed.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse