Reduce incremental indexing time of `words_prefix_position_docids` DB #776

loiclec · 2023-01-31T10:57:42Z

Fixes partially #605

The words_prefix_position_docids can easily contain millions of entries. Thus, iterating
over it can be very expensive. But we do so needlessly for every document addition tasks.

It can sometimes cause indexing performance issues when :

a user sends many documentAdditionOrUpdate tasks that cannot be all batched together (for example if they are interspersed with documentDeletion tasks)
the documents contain long, diverse text fields, thus increasing the number of entries in words_prefix_position_docids
the index has accumulated many soft-deleted documents, further increasing the size of words_prefix_position_docids
the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider)

Note, before approving the PR: the only changed file should be milli/src/update/words_prefix_position_docids.rs.

This database can easily contain millions of entries. Thus, iterating over it can be very expensive. For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words` will always be empty. Thus, we can save a significant amount of time by adding this `if !del_prefix_fst_words.is_empty()` condition. The code's behaviour remains completely unchanged.

dureuill

OK, the diff seems fully self-contained and doesn't appear to require any additional context.

Like noted in the comments, if the if del_prefix_fst_words.is_empty(), it follows that if del_prefix_fst_words.contains(x) if false forall x.

If the only effect of the loop is to call del_current on some elements of the iterator (in particular, no side-effect to simply iterating over the word_prefix_postfix, except for performance), then this code is correct.

Two questions, but not related to this PR per-se:

Why is the explicit drop needed? It looks to me like the iter will be dropped at the end of the block implicitly?
Why is unsafe { iter.del_current()? }; sound? I see there is an unsafe block without a // SAFETY: block comment.

Both of these questions should not block this PR from my point of view. Thank you for your work, and the block comment to explain the logic is especially appreciated 😊 .

loiclec · 2023-01-31T11:22:31Z

Yes, the drop(iter) is not needed. I kept it out of an abundance of caution.

Regarding the unsafe block, here is the documentation of del_current:

It is undefined behavior to keep a reference of a value from this database while modifying it.

Values returned from the database are valid only until a subsequent update operation, or the end of the transaction..

So I think it is unsafe because we still have access to the value prefix at this point, but we shouldn't touch it. I didn't change any of this code though :)

Thanks for your review Louis!

Kerollmops

It looks good to me 💯 Thank you for the fix!

loiclec · 2023-01-31T14:57:12Z

thank you! bors merge

curquiza · 2023-01-31T15:52:11Z

bors merge

bors · 2023-01-31T16:11:02Z

Build succeeded:

loiclec added no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Jan 31, 2023

loiclec requested review from Kerollmops, ManyTheFish and dureuill January 31, 2023 10:57

curquiza mentioned this pull request Jan 31, 2023

Reduce incremental indexing time of words_prefix_position_docids DB meilisearch/meilisearch#3438

Closed

dureuill approved these changes Jan 31, 2023

View reviewed changes

Kerollmops approved these changes Jan 31, 2023

View reviewed changes

bors bot merged commit 758b4ac into main Jan 31, 2023

bors bot deleted the incremental-indexing-words-prefix-positions-docids branch January 31, 2023 16:11

loiclec mentioned this pull request Feb 1, 2023

Routine indexing tasks complete without adding documents, then task hangs indefinitely meilisearch/meilisearch#3349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce incremental indexing time of `words_prefix_position_docids` DB #776

Reduce incremental indexing time of `words_prefix_position_docids` DB #776

Uh oh!

loiclec commented Jan 31, 2023

Uh oh!

dureuill left a comment •

edited

Loading

Uh oh!

loiclec commented Jan 31, 2023 •

edited

Loading

Uh oh!

Kerollmops left a comment

Uh oh!

loiclec commented Jan 31, 2023

Uh oh!

curquiza commented Jan 31, 2023

Uh oh!

bors bot commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reduce incremental indexing time of words_prefix_position_docids DB #776

Reduce incremental indexing time of words_prefix_position_docids DB #776

Uh oh!

Conversation

loiclec commented Jan 31, 2023

Uh oh!

dureuill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loiclec commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kerollmops left a comment

Choose a reason for hiding this comment

Uh oh!

loiclec commented Jan 31, 2023

Uh oh!

curquiza commented Jan 31, 2023

Uh oh!

bors bot commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reduce incremental indexing time of `words_prefix_position_docids` DB #776

Reduce incremental indexing time of `words_prefix_position_docids` DB #776

dureuill left a comment •

edited

Loading

loiclec commented Jan 31, 2023 •

edited

Loading