Skip to content

Architecture:Hoover

Thibault François edited this page Jun 15, 2023 · 1 revision

Hoover in the Liquid bundle

The Liquid Investigations bundle includes Hoover (hoover-search, hoover-snoop, hoover-ui). It's pre-configured with a collection named uploads that is visible in Architecture:Nextcloud and indexed periodically by snoop.

[Architecture:Nextcloud]

Data Format

Collection data must be placed in a specific place for Hoover to pick up the files.

There are three types of data ingress supported for each collection:

  • data: Collection data
  • ocr: External OCR, matched by MD5, containing different "OCR Sources" (see below)
  • gpghome: A GPG-Home directory, populated with keys used to open content in the data folder

Here is an example of a directory listing of the Collections directory with three collections called archive, bug and testdata-6.

.
├── archive
│   └── data
│       └── testdata.zip
├── bug
│   └── data
│       └── uploads.zip
└── testdata-6
    ├── data
    │   └── some-files
    │       └── archives
    │           └── ...
    ├── gpghome
    │   ├── pubring.kbx
    │   ├── ...
    │   └── trustdb.gpg
    └── ocr
        ├── one
        │   └── foo
        │       └── bar
        │           └── f
        │               └── d
        │                   └── fd41b8f1fe19c151517b3cda2a615fa8.pdf
        └── two
            └── fd41b8f1fe19c151517b3cda2a615fa8.pdf.txt

In the above example, the testdata-6 collection has two OCR sources: one and two. Both External OCR files are shown in the page for document with MD5 = fd41b8f1fe19c151517b3cda2a615fa8. Adding the OCR sources is done with a createocrsource command, see the section below.


Warning: Please make sure original data is placed under a data directory for the collection name.

--------------------------------------------------------
|   This is correct:        |   This is NOT correct:   |
|                           |                          |
|                           |                          |
|   .                       |   .                      |
|   └── collection3         |   └── collection3        |
|       └── data            |       └── data.zip       |
|           └── data.zip    |                          |
|                           |                          |
--------------------------------------------------------

Example: Testdata

Set up the testdata collection. First download the data:

mkdir -p collections
git clone https://github.com/liquidinvestigations/testdata collections/testdata

Next define the collection in liquid.ini:

[collection:testdata]
process = True

Then let the deploy command pick up the new collection:

./liquid deploy

Adding collections

All collections are loaded from the liquid_collections directory configured in liquid.ini. The directories directly under liquid_collections can NOT be symlinks, since the docker container must access the data even if it's symlinked outside of the mounted directory. Instead of symlinks, use bind mounts or nfs mounts.

To add new collections simply append to the liquid.ini file:

[collection:always-changes]
process = True
sync = True

[collection:static-data]
process = True

... and run ./liquid deploy. The requested number of workers and their dependencies will be deployed on the Nomad cluster; see them run on the Nomad UI.


The two parameters control:

  • process: on/off switch for processing this collection, defaults to False.
  • sync: wether the workers should track the collection data and re-process changed/new documents

The collection names must follow the elasticsearch index naming guide, namely lowercase alphanumeric, dashes and numbers only.

Removing collections

In order to remove a collection, take the following steps:

  1. Remove the corresponding collection section from the liquid.ini file.
  2. Run ./liquid deploy
  3. Run ./liquid shell hoover:snoop ./manage.py purge -- use optional argument --force to skip manual confirmation.

OCR

Integrated Tesseract OCR

Use the collection's ocr_languages config value to set any number of languages for tesseract 4.0 LSTM.

After changing the ocr_languages setting for an already processed collection, please run:

 ./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION -- --func digests.launch

External OCR

Some datasets might come with OCR already processed. Hoover can import both TXT and PDF outputs from external OCR. Multiple batches or versions of OCR data may exist, so each OCR source is identified by a name, written in this section as SOURCE_NAME.

Place your OCR outputs under the directory collections/COLLECTION/ocr/SOURCE_NAME (as opposed to collections/COLLECTION/data where the original data is). The SOURCE_NAME will be the identifier for your external OCR source.

The OCR outputs:

  • can be placed at any depth inside the collections/COLLECTION/ocr/SOURCE_NAME directory,
  • must have its filename start with the document MD5 and end with either .pdf or .txt extensions

After writing the OCR outputs into a new OCR_SOURCE, run the following commands:

./liquid dockerexec hoover:snoop ./manage.py createocrsource COLLECTION SOURCE
./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION --func ocr.walk_source

After adding files to an existing OCR source, you can re-walk the source directory by running the line:

./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION --func ocr.walk_source

This command will automatically re-process the collection documents to include the new external OCR data.

An example with testdata

You can test out this feature on the testdata collection. Take a look at the file structure here. After cloning and configuring testdata as instructed in liquid.ini, run these commands:

./liquid dockerexec hoover:snoop ./manage.py createocrsource testdata one
./liquid dockerexec hoover:snoop ./manage.py createocrsource testdata two
./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --func ocr.walk_source

And verify that they exist under the results for document with MD5 = fd41b8f1fe19c151517b3cda2a615fa8 by searching the testdata collection with the query md5:fd41b8f1fe19c151517b3cda2a615fa8. You should see OCR versions for one, two as well as any Tesseract languages configured for the collection:

ocr-external

Re-processing and adding new data

You can trigger a manual re-walk of all directories (to find new and changed data) with:

./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --func filesystem.walk

You can trigger a manual retry for failed tasks with:

./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --status error --status broken
Clone this wiki locally