Question about the size of the 'state' directory #533

cgr71ii · 2022-09-07T13:29:30Z

cgr71ii
Sep 7, 2022

Hi!

I've crawling with Heritrix for 5 hours and I've noticed that the state directory size is bigger even that my downloaded WARCs. My configuration, briefly:

Seeds: ~1200 domains
Downloading only text
Downloading from all subdomains from the provided domains
600 threads
100 GiB heap size
According to statistics (which are not correct: Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453) #455):
- 3,369,654 downloaded + 166,045,099 queued = 169,415,353 total
- 389 GiB crawled (389 GiB novel, 0 B dupByHash, 0 B notModified)
- 153.46 URIs/sec (185.63 avg); 24,187 KB/sec (22,499 avg)

Sizes:

113G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/state
107G    build_1662469305/heritrix-3.4.0-SNAPSHOT/jobs/paracrawl9_experiment_without_classifier/20220907073359/warcs

In the state directory there are, approximately, 12500 files.

Is the size of the state directory what someone would expect crawling with a configuration like the one I'm using? Is this state directory needed? What is the purpose of this directory and their files? Can be optimized in order to decrease the size?

Thank you!

Answered by anjackson

Sep 7, 2022

Hi @cgr71ii,

From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler state folder.

If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.

So, yes, this is what I'd expect to see, given the size of your crawl frontier.

Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.

HTH,
Andy…

View full answer

anjackson · 2022-09-07T14:02:33Z

anjackson
Sep 7, 2022
Maintainer

Hi @cgr71ii,

From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler state folder.

If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.

So, yes, this is what I'd expect to see, given the size of your crawl frontier.

Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.

HTH,
Andy Jackson

0 replies

ato · 2022-09-07T14:12:56Z

ato
Sep 7, 2022
Maintainer

What is the purpose of this directory and their files?

The state directory is where Heritrix keeps the BDB databases tracking things like the state of the crawl, i.e. the set of URLs it has already seen (so it knows not to visit them again) and the queue of URLs to visit in future. 680 bytes per discovered URL doesn't seem unreasonable given the overheads of BDB and the generic serialization mechanism Heritrix uses. Looking at a recent large crawl here I'm seeing about 792 bytes per URL discovered.

Can be optimized in order to decrease the size?

Reducing the scope of the crawl should prevent the queue from growing so large. If you're really tight on space and can't reduce the scope you could divide up your seeds and crawl them in separate jobs, deleting the state directory of the previous job before starting the next one.

It would likely be a lot of work but there's almost certainly ways the code could be modified to use more efficient serialization (the extreme end of that would be swapping out BDB for a more space-efficient database like RocksDB), although there'll be a hard limit eventually. My gut feeling is it'd be somewhere around 100 bytes per URL without losing information/features.

Is this state directory needed?

During the crawl yes. You don't need to keep it after the crawl unless you're using the deduplication feature.

0 replies

cgr71ii · 2022-09-07T19:08:12Z

cgr71ii
Sep 7, 2022
Author

Thank you both of you for the explanation and the alternatives! :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the size of the 'state' directory #533

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Question about the size of the 'state' directory #533

cgr71ii Sep 7, 2022

Replies: 3 comments

anjackson Sep 7, 2022 Maintainer

ato Sep 7, 2022 Maintainer

cgr71ii Sep 7, 2022 Author

cgr71ii
Sep 7, 2022

anjackson
Sep 7, 2022
Maintainer

ato
Sep 7, 2022
Maintainer

cgr71ii
Sep 7, 2022
Author