Skip to content
This repository was archived by the owner on Apr 4, 2023. It is now read-only.

Conversation

@loiclec
Copy link
Contributor

@loiclec loiclec commented Jan 31, 2023

Fixes partially #605

The words_prefix_position_docids can easily contain millions of entries. Thus, iterating
over it can be very expensive. But we do so needlessly for every document addition tasks.

It can sometimes cause indexing performance issues when :

  • a user sends many documentAdditionOrUpdate tasks that cannot be all batched together (for example if they are interspersed with documentDeletion tasks)
  • the documents contain long, diverse text fields, thus increasing the number of entries in words_prefix_position_docids
  • the index has accumulated many soft-deleted documents, further increasing the size of words_prefix_position_docids
  • the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider)

Note, before approving the PR: the only changed file should be milli/src/update/words_prefix_position_docids.rs.

This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.

For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.

The code's behaviour remains completely unchanged.
@loiclec loiclec added no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Jan 31, 2023
Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the diff seems fully self-contained and doesn't appear to require any additional context.

Like noted in the comments, if the if del_prefix_fst_words.is_empty(), it follows that if del_prefix_fst_words.contains(x) if false forall x.

If the only effect of the loop is to call del_current on some elements of the iterator (in particular, no side-effect to simply iterating over the word_prefix_postfix, except for performance), then this code is correct.

Two questions, but not related to this PR per-se:

  1. Why is the explicit drop needed? It looks to me like the iter will be dropped at the end of the block implicitly?
  2. Why is unsafe { iter.del_current()? }; sound? I see there is an unsafe block without a // SAFETY: block comment.

Both of these questions should not block this PR from my point of view. Thank you for your work, and the block comment to explain the logic is especially appreciated 😊 .

@loiclec
Copy link
Contributor Author

loiclec commented Jan 31, 2023

Yes, the drop(iter) is not needed. I kept it out of an abundance of caution.

Regarding the unsafe block, here is the documentation of del_current:

It is undefined behavior to keep a reference of a value from this database while modifying it.

Values returned from the database are valid only until a subsequent update operation, or the end of the transaction..

So I think it is unsafe because we still have access to the value prefix at this point, but we shouldn't touch it. I didn't change any of this code though :)

Thanks for your review Louis!

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me 💯 Thank you for the fix!

@loiclec
Copy link
Contributor Author

loiclec commented Jan 31, 2023

thank you! bors merge

@curquiza
Copy link
Member

bors merge

@bors
Copy link
Contributor

bors bot commented Jan 31, 2023

Build succeeded:

@bors bors bot merged commit 758b4ac into main Jan 31, 2023
@bors bors bot deleted the incremental-indexing-words-prefix-positions-docids branch January 31, 2023 16:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants