Crawling our website takes a very long time, looking for configuration advice #570

Glenruben · 2023-11-13T10:05:15Z

Glenruben
Nov 13, 2023

Hi!
We are looking to crawl our old website for archival purposes, and to re-host old and vulnerable sites as static archives.
We want to make a complete crawl, but as the job progresses it slows down, as our frontier queues are depleted and put into "snooze" status as they are empty. This slows the whole job down. It's relatively fast to start with, running at ~5 urls/sec, but slows down to ~0.5 urls/sec after about a day or two. We are trying to improve our configuration but can not find the right documentation on this matter.
Is it possible to re-balance queues, so an empty queue "takes over" urls from other queues? Should we just use a single queue instead of many?
I'm attaching our configuration if that might help: https://pastebin.com/vwkWBiUY

Answered by kris-sigur

Nov 14, 2023

The queues are configured to limit the load on each target host/domain. As some are completed, the overall crawl rate drops as there are fewer hosts remaining to be crawled. Heritrix is, by default, configured to be reasonably polite in its crawling to avoid overwhelming individual sites (or getting it banned by angry sysadmins). Of course this means that very large sites take a very long time as you are only crawling one URL every few seconds.

The relevant settings are in this section:

 <bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor">
  <property name="delayFactor" value="1.0" />
  <property name="minDelayMs" value="500" />
  <property name="respectC…

View full answer

kris-sigur · 2023-11-14T08:23:57Z

kris-sigur
Nov 14, 2023
Maintainer

The queues are configured to limit the load on each target host/domain. As some are completed, the overall crawl rate drops as there are fewer hosts remaining to be crawled. Heritrix is, by default, configured to be reasonably polite in its crawling to avoid overwhelming individual sites (or getting it banned by angry sysadmins). Of course this means that very large sites take a very long time as you are only crawling one URL every few seconds.

The relevant settings are in this section:

 <bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor">
  <property name="delayFactor" value="1.0" />
  <property name="minDelayMs" value="500" />
  <property name="respectCrawlDelayUpToSeconds" value="3" />
  <property name="maxDelayMs" value="3000" />
  <property name="maxPerHostBandwidthUsageKbSec" value="0" />
 </bean>

Between successive fetches to the same "queue" (host) Heritrix waits the amount of time it took to fetch the previous URL from that queue, multiplied by the "delayFactor" and bounded by the min and maxDelayMS (0.5 and 3 seconds respectively here). It is also configured in the above to respect robots.txt crawlDelay directive of up to 3 seconds. I'd generally consider this to be a reasonable configuration unless you have explicit permission to harvest a site more aggressively.

1 reply

Glenruben Nov 15, 2023
Author

Thank you for your thorough answer! I understand the reasoning behind the architecture a bit better now. I think we have a better configuration going and that will probably be able to finish the crawl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling our website takes a very long time, looking for configuration advice #570

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Crawling our website takes a very long time, looking for configuration advice #570

Glenruben Nov 13, 2023

Replies: 1 comment · 1 reply

kris-sigur Nov 14, 2023 Maintainer

Glenruben Nov 15, 2023 Author

Glenruben
Nov 13, 2023

Replies: 1 comment 1 reply

kris-sigur
Nov 14, 2023
Maintainer

Glenruben Nov 15, 2023
Author