How to resume a crawl for later #506

JenPho · 2022-09-10T10:04:42Z

JenPho
Sep 10, 2022

I am currently downloading a site using Heritrix, and I don't exactly want to leave my computer on overnight. Can I simply just stop a crawl, and resume it later?

Taking a look at https://heritrix.readthedocs.io/en/latest/operating.html#full-recovery, I determined that if I were to 'accidentally' crash the java program, I can put the /jobs/x/date/logs/frontier.recover.gz file in /jobs/x/action, create the server and launch the job again to resume it. Is this correct, or were crawls meant to be a do-it-all-right-now thing? I've tried this and it didn't really work. I used kill PID on the server in my Terminal, relaunched it to see that it started scraping under a new directory and that it moved my frontier.recover.gz file to /jobs/x/action/done, so I'm not sure if it worked.

Answered by ato

Sep 30, 2022

The primary way to stop and later resume a crawl is by creating a checkpoint (see wiki page). It seems I overlooked including this wiki page in the operating guide, sorry about that, I'll update it. (Edit: Done as of b328ded)

While the recovery log can be used in a pinch I believe it's really intended as a fallback option in case the crawler crashes or the crawl state becomes corrupted and a usable checkpoint is not available. I'm not certain of the disadvantages of using the recovery log but I'd guess that some crawl state might be incorrect (perhaps the statistics?) and for a large crawl it could take a lot more time to replay the recovery log than loading a snapshot would.

View full answer

ato · 2022-09-30T01:12:07Z

ato
Sep 30, 2022
Maintainer

The primary way to stop and later resume a crawl is by creating a checkpoint (see wiki page). It seems I overlooked including this wiki page in the operating guide, sorry about that, I'll update it. (Edit: Done as of b328ded)

While the recovery log can be used in a pinch I believe it's really intended as a fallback option in case the crawler crashes or the crawl state becomes corrupted and a usable checkpoint is not available. I'm not certain of the disadvantages of using the recovery log but I'd guess that some crawl state might be incorrect (perhaps the statistics?) and for a large crawl it could take a lot more time to replay the recovery log than loading a snapshot would.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume a crawl for later #506

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to resume a crawl for later #506

JenPho Sep 10, 2022

Replies: 1 comment

ato Sep 30, 2022 Maintainer

JenPho
Sep 10, 2022

ato
Sep 30, 2022
Maintainer