-
I am currently downloading a site using Heritrix, and I don't exactly want to leave my computer on overnight. Can I simply just stop a crawl, and resume it later? Taking a look at https://heritrix.readthedocs.io/en/latest/operating.html#full-recovery, I determined that if I were to 'accidentally' crash the java program, I can put the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The primary way to stop and later resume a crawl is by creating a checkpoint (see wiki page). It seems I overlooked including this wiki page in the operating guide, sorry about that, I'll update it. (Edit: Done as of b328ded) While the recovery log can be used in a pinch I believe it's really intended as a fallback option in case the crawler crashes or the crawl state becomes corrupted and a usable checkpoint is not available. I'm not certain of the disadvantages of using the recovery log but I'd guess that some crawl state might be incorrect (perhaps the statistics?) and for a large crawl it could take a lot more time to replay the recovery log than loading a snapshot would. |
Beta Was this translation helpful? Give feedback.
The primary way to stop and later resume a crawl is by creating a checkpoint (see wiki page). It seems I overlooked including this wiki page in the operating guide, sorry about that, I'll update it. (Edit: Done as of b328ded)
While the recovery log can be used in a pinch I believe it's really intended as a fallback option in case the crawler crashes or the crawl state becomes corrupted and a usable checkpoint is not available. I'm not certain of the disadvantages of using the recovery log but I'd guess that some crawl state might be incorrect (perhaps the statistics?) and for a large crawl it could take a lot more time to replay the recovery log than loading a snapshot would.