Skip to content

Commit

Permalink
docs: Add 'Checkpointing' section to operating guide
Browse files Browse the repository at this point in the history
Fixes #500 #506
  • Loading branch information
ato committed Sep 30, 2022
1 parent 3c79a0b commit b328ded
Showing 1 changed file with 56 additions and 0 deletions.
56 changes: 56 additions & 0 deletions docs/operating.rst
Original file line number Diff line number Diff line change
Expand Up @@ -653,6 +653,62 @@ Note that the recovery journal format's 'F+' lines may include a
via the above mechanisms, but that this may not be a complete
representation of all URI state from its discovery in a normal crawl.

Checkpointing
-------------

Checkpointing a crawl job writes a representation of the current state of the job under the ``checkpoints`` directory
which can be used to restart the job from the same point.

Checkpointed state includes serialization of the main crawl job objects, copies of the current set of bdbje log files,
and other files that represent the state of the crawl. The checkpoint directory contains all that is required to
recover a crawl. Checkpointing also rotates the crawl logs, including the recover.gz log, if enabled. Log files are
NOT copied to the checkpoint directory. They are left under the logs directory and are distinguished by a suffix. The
suffix is the checkpoint name. For example, for checkpoint cp00001-20220930061713 the crawl log would be named
crawl.log.cp00001-20220930061713.

To make checkpointing faster and reduce disk space usage, hardlinks on systems that support them to collect the
BerkeleyDB-JE files required to reproduce the crawler's state.

To run a checkpoint, click the checkpoint button on the job page of the WUI or invoke the checkpoint functionality
through the REST API. While checkpointing, the crawl status will show as CHECKPOINTING. When the checkpoint has
completed, the crawler will resume crawling, unless it was in the paused state when the checkpoint was invoked.
In this case, the crawler will re-enter the paused state.

Recovery from a checkpoint has much in common with the recovery of a crawl using the frontier.recovery.log.

Automated Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~

To configure Heritrix to automatically run checkpoints periodically, set the
``checkpointService.checkpointIntervalMinutes`` property:

.. code-block:: xml
<bean id="checkpointService" class="org.archive.crawler.framework.CheckpointService">
<property name="checkpointIntervalMinutes" value="60"/>
<!-- <property name="checkpointsDir" value="checkpoints"/> -->
<!-- <property name="forgetAllButLatest" value="true"/> -->
</bean>
By default only the latest checkpoint will be kept.

Restarting from a Checkpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The web UI provides an option to restart a crawl from a checkpoint:

1. Checkpoint the running crawl by clicking the "checkpoint" button.
2. When the checkpoint ends (a message will be displayed informing the operator of this event) terminate the crawl by
clicking the "terminate" button.
3. Teardown the job by clicking the "teardown" button.
4. Re-build the job by clicking the "build" button. At this point a dropdown box should appear under the command
buttons. The dropdown box has the names of the previously invoked checkpoints.
5. Select a checkpoint from the dropdown. The selected checkpoint will be used to start the newly built job.
6. Click launch
7. Click unpause

The job will now begin running from the chosen checkpoint.

Crawl Recovery
--------------

Expand Down

0 comments on commit b328ded

Please sign in to comment.