Crawling our website takes a very long time, looking for configuration advice #570
-
Hi! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The queues are configured to limit the load on each target host/domain. As some are completed, the overall crawl rate drops as there are fewer hosts remaining to be crawled. Heritrix is, by default, configured to be reasonably polite in its crawling to avoid overwhelming individual sites (or getting it banned by angry sysadmins). Of course this means that very large sites take a very long time as you are only crawling one URL every few seconds. The relevant settings are in this section:
Between successive fetches to the same "queue" (host) Heritrix waits the amount of time it took to fetch the previous URL from that queue, multiplied by the "delayFactor" and bounded by the min and maxDelayMS (0.5 and 3 seconds respectively here). It is also configured in the above to respect robots.txt crawlDelay directive of up to 3 seconds. I'd generally consider this to be a reasonable configuration unless you have explicit permission to harvest a site more aggressively. |
Beta Was this translation helpful? Give feedback.
The queues are configured to limit the load on each target host/domain. As some are completed, the overall crawl rate drops as there are fewer hosts remaining to be crawled. Heritrix is, by default, configured to be reasonably polite in its crawling to avoid overwhelming individual sites (or getting it banned by angry sysadmins). Of course this means that very large sites take a very long time as you are only crawling one URL every few seconds.
The relevant settings are in this section: