-
Notifications
You must be signed in to change notification settings - Fork 3
Architecture:Hoover
The Liquid Investigations bundle includes Hoover (hoover-search,
hoover-snoop, hoover-ui). It's pre-configured with a collection named
uploads
that is visible in Architecture:Nextcloud and indexed periodically by snoop.
[Architecture:Nextcloud]
Collection data must be placed in a specific place for Hoover to pick up the files.
There are three types of data ingress supported for each collection:
-
data
: Collection data -
ocr
: External OCR, matched by MD5, containing different "OCR Sources" (see below) -
gpghome
: A GPG-Home directory, populated with keys used to open content in the data folder
Here is an example of a directory listing of the Collections directory with three collections called archive
, bug
and testdata-6
.
.
├── archive
│ └── data
│ └── testdata.zip
├── bug
│ └── data
│ └── uploads.zip
└── testdata-6
├── data
│ └── some-files
│ └── archives
│ └── ...
├── gpghome
│ ├── pubring.kbx
│ ├── ...
│ └── trustdb.gpg
└── ocr
├── one
│ └── foo
│ └── bar
│ └── f
│ └── d
│ └── fd41b8f1fe19c151517b3cda2a615fa8.pdf
└── two
└── fd41b8f1fe19c151517b3cda2a615fa8.pdf.txt
In the above example, the testdata-6
collection has two OCR sources: one
and two
. Both External OCR files are shown in the page for document with MD5 = fd41b8f1fe19c151517b3cda2a615fa8
. Adding the OCR sources is done with a createocrsource
command, see the section below.
Warning: Please make sure original data is placed under a data
directory for the collection name.
--------------------------------------------------------
| This is correct: | This is NOT correct: |
| | |
| | |
| . | . |
| └── collection3 | └── collection3 |
| └── data | └── data.zip |
| └── data.zip | |
| | |
--------------------------------------------------------
Set up the testdata
collection. First download the data:
mkdir -p collections
git clone https://github.com/liquidinvestigations/testdata collections/testdata
Next define the collection in liquid.ini
:
[collection:testdata]
process = True
Then let the deploy
command pick up the new collection:
./liquid deploy
All collections are loaded from the liquid_collections
directory configured in liquid.ini
.
The directories directly under liquid_collections
can NOT be symlinks, since the docker container must access the data even if it's symlinked outside of the mounted directory. Instead of symlinks, use bind mounts or nfs mounts.
To add new collections simply append to the liquid.ini
file:
[collection:always-changes]
process = True
sync = True
[collection:static-data]
process = True
... and run ./liquid deploy
. The requested number of workers and their dependencies will be deployed on the Nomad cluster; see them run on the Nomad UI.
The two parameters control:
-
process
: on/off switch for processing this collection, defaults to False. -
sync
: wether the workers should track the collection data and re-process changed/new documents
The collection names must follow the elasticsearch index naming guide, namely lowercase alphanumeric, dashes and numbers only.
In order to remove a collection, take the following steps:
- Remove the corresponding collection section from the
liquid.ini
file. - Run
./liquid deploy
- Run
./liquid shell hoover:snoop ./manage.py purge
-- use optional argument--force
to skip manual confirmation.
Use the collection's ocr_languages
config value to set any number of
languages for tesseract 4.0
LSTM.
After changing the ocr_languages
setting for an already processed collection, please run:
./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION -- --func digests.launch
Some datasets might come with OCR already processed. Hoover can import both TXT and PDF outputs from external OCR.
Multiple batches or versions of OCR data may exist, so each OCR source is identified by a name, written in this section as SOURCE_NAME
.
Place your OCR outputs under the directory collections/COLLECTION/ocr/SOURCE_NAME
(as opposed to collections/COLLECTION/data
where the original data is).
The SOURCE_NAME
will be the identifier for your external OCR source.
The OCR outputs:
- can be placed at any depth inside the
collections/COLLECTION/ocr/SOURCE_NAME
directory, -
must have its filename start with the document MD5 and end with either
.pdf
or.txt
extensions
After writing the OCR outputs into a new OCR_SOURCE, run the following commands:
./liquid dockerexec hoover:snoop ./manage.py createocrsource COLLECTION SOURCE
./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION --func ocr.walk_source
After adding files to an existing OCR source, you can re-walk the source directory by running the line:
./liquid dockerexec hoover:snoop ./manage.py retrytasks COLLECTION --func ocr.walk_source
This command will automatically re-process the collection documents to include the new external OCR data.
You can test out this feature on the testdata collection. Take a look at the file structure here. After cloning and configuring testdata as instructed in liquid.ini
, run these commands:
./liquid dockerexec hoover:snoop ./manage.py createocrsource testdata one
./liquid dockerexec hoover:snoop ./manage.py createocrsource testdata two
./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --func ocr.walk_source
And verify that they exist under the results for document with MD5 = fd41b8f1fe19c151517b3cda2a615fa8
by searching the testdata
collection with the query md5:fd41b8f1fe19c151517b3cda2a615fa8
. You should see OCR versions for one
, two
as well as any Tesseract languages configured for the collection:
You can trigger a manual re-walk of all directories (to find new and changed data) with:
./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --func filesystem.walk
You can trigger a manual retry for failed tasks with:
./liquid dockerexec hoover:snoop ./manage.py retrytasks testdata --status error --status broken
Report incomplete documentation by opening a new Issue in this repository.