-
Hi! I've crawling with Heritrix for 5 hours and I've noticed that the
Sizes:
In the Is the size of the Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Hi @cgr71ii, From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL. So, yes, this is what I'd expect to see, given the size of your crawl frontier. Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually. HTH, |
Beta Was this translation helpful? Give feedback.
-
The state directory is where Heritrix keeps the BDB databases tracking things like the state of the crawl, i.e. the set of URLs it has already seen (so it knows not to visit them again) and the queue of URLs to visit in future. 680 bytes per discovered URL doesn't seem unreasonable given the overheads of BDB and the generic serialization mechanism Heritrix uses. Looking at a recent large crawl here I'm seeing about 792 bytes per URL discovered.
Reducing the scope of the crawl should prevent the queue from growing so large. If you're really tight on space and can't reduce the scope you could divide up your seeds and crawl them in separate jobs, deleting the state directory of the previous job before starting the next one. It would likely be a lot of work but there's almost certainly ways the code could be modified to use more efficient serialization (the extreme end of that would be swapping out BDB for a more space-efficient database like RocksDB), although there'll be a hard limit eventually. My gut feeling is it'd be somewhere around 100 bytes per URL without losing information/features.
During the crawl yes. You don't need to keep it after the crawl unless you're using the deduplication feature. |
Beta Was this translation helpful? Give feedback.
-
Thank you both of you for the explanation and the alternatives! :) |
Beta Was this translation helpful? Give feedback.
Hi @cgr71ii,
From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler
state
folder.If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.
So, yes, this is what I'd expect to see, given the size of your crawl frontier.
Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.
HTH,
Andy…