Write vocabulary files to separate directory #1237

The vocabulary files can now be written to (and read from) a separate directory than the other index files. This directory can be specified in both `IndexBuilderMain` and `ServerMain` via the new command-line option `--vocabulary-basename` or `-v`. The directory for the index files is specified like before via `--index-basename` or `-i`. If no directory for the vocabulary files is specified, the same directory as for the index files is taken (that was the status quo before this PR). This is useful for datasets with a huge vocabulary. For example, the vocabulary files for UniProt have a total size of over 3 TB (mostly due to `.vocabulary.external` and `.vocabulary.external.idsAndOffsets`).

Reason: The merge was very SLOW when these were in the vocabulary directory, which for our UniProt index builds is on HDD (because the external vocabulary is so larger). I first tried to only have the `.tmp.partial-vocabulary.words` files in the index directory, but that was still slow. Now also the `.tmp.partial-vocabulary.ids` files are in the index directory. Explanations concerning SLOW: The merging of the first few 100M triples is fast (30 seconds per 100M triples). Then it becomes slow and then very slow (half an hour from 700M triples to 800M triples). Not only is it slow, but doing other stuff on the machine (like wrting something in an editor with autosave on) becomes very slow to respond to, which is a clear sign that the random accesses to HDD are the problem. NOTE: With the partial solution, where `.tmp.partial-vocabulary.words` are on SSD and `.tmp.partial-vocabulary.ids` are on HDD, it is not as bad. There was a very significant slow-down from 700M to 1100M triples, but after that merging was as fast again (though not as fast as in the beginning). At the time of this writing, I only observed until 1700M, stay tuned for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write vocabulary files to separate directory #1237

Write vocabulary files to separate directory #1237

Commits on Feb 2, 2024

Write vocabulary files to separate directory #1237

Are you sure you want to change the base?

Write vocabulary files to separate directory #1237

Commits on Feb 2, 2024