-
Notifications
You must be signed in to change notification settings - Fork 83
Reduce incremental indexing time of words_prefix_position_docids DB
#776
Conversation
This database can easily contain millions of entries. Thus, iterating over it can be very expensive. For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words` will always be empty. Thus, we can save a significant amount of time by adding this `if !del_prefix_fst_words.is_empty()` condition. The code's behaviour remains completely unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the diff seems fully self-contained and doesn't appear to require any additional context.
Like noted in the comments, if the if del_prefix_fst_words.is_empty(), it follows that if del_prefix_fst_words.contains(x) if false forall x.
If the only effect of the loop is to call del_current on some elements of the iterator (in particular, no side-effect to simply iterating over the word_prefix_postfix, except for performance), then this code is correct.
Two questions, but not related to this PR per-se:
- Why is the explicit drop needed? It looks to me like the iter will be dropped at the end of the block implicitly?
- Why is
unsafe { iter.del_current()? };sound? I see there is anunsafeblock without a// SAFETY:block comment.
Both of these questions should not block this PR from my point of view. Thank you for your work, and the block comment to explain the logic is especially appreciated 😊 .
|
Yes, the Regarding the unsafe block, here is the documentation of
So I think it is unsafe because we still have access to the value Thanks for your review Louis! |
Kerollmops
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me 💯 Thank you for the fix!
|
thank you! bors merge |
|
bors merge |
|
Build succeeded:
|
Fixes partially #605
The
words_prefix_position_docidscan easily contain millions of entries. Thus, iteratingover it can be very expensive. But we do so needlessly for every document addition tasks.
It can sometimes cause indexing performance issues when :
documentAdditionOrUpdatetasks that cannot be all batched together (for example if they are interspersed withdocumentDeletiontasks)words_prefix_position_docidswords_prefix_position_docidsNote, before approving the PR: the only changed file should be
milli/src/update/words_prefix_position_docids.rs.