Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write vocabulary files to separate directory #1237

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Commits on Feb 2, 2024

  1. Write vocabulary files to separate directory

    The vocabulary files can now be written to (and read from) a separate
    directory than the other index files. This directory can be specified in
    both `IndexBuilderMain` and `ServerMain` via the new command-line option
    `--vocabulary-basename` or `-v`. The directory for the index files is
    specified like before via `--index-basename` or `-i`. If no directory
    for the vocabulary files is specified, the same directory as for the
    index files is taken (that was the status quo before this PR).
    
    This is useful for datasets with a huge vocabulary. For example, the
    vocabulary files for UniProt have a total size of over 3 TB (mostly due
    to `.vocabulary.external` and `.vocabulary.external.idsAndOffsets`).
    Hannah Bast committed Feb 2, 2024
    Configuration menu
    Copy the full SHA
    d1ca3ed View commit details
    Browse the repository at this point in the history
  2. Remove log message used for debugging.

    Hannah Bast committed Feb 2, 2024
    Configuration menu
    Copy the full SHA
    dffc75a View commit details
    Browse the repository at this point in the history
  3. Have the .tmp files in the index directory

    Reason: The merge was very SLOW when these were in the vocabulary
    directory, which for our UniProt index builds is on HDD (because the
    external vocabulary is so larger). I first tried to only have the
    `.tmp.partial-vocabulary.words` files in the index directory, but that
    was still slow. Now also the `.tmp.partial-vocabulary.ids` files are in
    the index directory.
    
    Explanations concerning SLOW: The merging of the first few 100M triples
    is fast (30 seconds per 100M triples). Then it becomes slow and then
    very slow (half an hour from 700M triples to 800M triples). Not only is
    it slow, but doing other stuff on the machine (like wrting something in
    an editor with autosave on) becomes very slow to respond to, which is a
    clear sign that the random accesses to HDD are the problem.
    
    NOTE: With the partial solution, where `.tmp.partial-vocabulary.words`
    are on SSD and `.tmp.partial-vocabulary.ids` are on HDD, it is not as
    bad. There was a very significant slow-down from 700M to 1100M triples,
    but after that merging was as fast again (though not as fast as in the
    beginning). At the time of this writing, I only observed until 1700M,
    stay tuned for more information.
    Hannah Bast committed Feb 2, 2024
    Configuration menu
    Copy the full SHA
    81f2e53 View commit details
    Browse the repository at this point in the history