Optimise facets indexing #590

loiclec · 2022-07-20T07:47:43Z

Pull Request

What does this PR do?

Fixes #589

Notes

I added documentation for the whole module which attempts to explain the shape of the databases and their purpose. However, I realise there is already some documentation about this, so I am not sure if we want to keep it.

Benchmarks

We get a ~1.15x speed up on the geo_point benchmark.

group                                                                     indexing_main_57042355                  indexing_optimise-facets-indexation_5728619a
-----                                                                     ----------------------                  --------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1862.7±294.45µs        ? ?/sec    1.58      2.9±1.32ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.11      8.9±2.44ms        ? ?/sec     1.00      8.0±1.42ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.8±3.32ms        ? ?/sec     1.32     16.9±6.98ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.09     43.8±4.78ms        ? ?/sec     1.00     40.3±3.79ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   287.4±28.72ms        ? ?/sec     1.00    264.9±9.46ms        ? ?/sec
indexing/Indexing geo_point                                               1.14      61.2±0.39s        ? ?/sec     1.00      53.8±0.57s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      16.6±0.12s        ? ?/sec     1.00      16.5±0.10s        ? ?/sec
indexing/Indexing movies with default settings                            1.00      14.1±0.30s        ? ?/sec     1.00      14.0±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.9±0.50s        ? ?/sec     1.00      10.0±0.10s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.01       9.6±0.23s        ? ?/sec     1.00       9.5±0.06s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.07      66.3±0.55s        ? ?/sec     1.00      61.8±0.63s        ? ?/sec
indexing/Indexing songs with default settings                             1.03      58.8±0.82s        ? ?/sec     1.00      57.1±1.22s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      53.6±1.09s        ? ?/sec     1.01      54.0±0.58s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.02      58.0±1.29s        ? ?/sec     1.00      57.1±1.43s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1064.1±21.20s        ? ?/sec     1.00   1068.0±20.49s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1182.5±9.62s        ? ?/sec     1.01   1191.2±10.96s        ? ?/sec
indexing/Reindexing geo_point                                             1.12      68.0±0.21s        ? ?/sec     1.00      60.5±0.82s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01      14.1±0.21s        ? ?/sec     1.00      14.0±0.26s        ? ?/sec
indexing/Reindexing songs with default settings                           1.04      61.6±0.57s        ? ?/sec     1.00      59.2±0.87s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1734.0±11.38s        ? ?/sec     1.01   1746.6±22.48s        ? ?/sec

loiclec · 2022-08-03T16:37:12Z

Note: I'm putting this PR back into draft status as I don't think the improvement is good enough.

The problem is that after adding just one document, we need to iterate over the whole database and perform a ton of bitmap operations, which is very slow. Ideally, the algorithm should be incremental and modify the database in-place instead of clearing it and rebuilding it from scratch. I am starting to think about what a proper solution will be and will update this PR later.

loiclec · 2022-08-15T12:45:38Z

After looking a bit more into it, I realised there is a problem with the current structure of index.facet_id_string_docids. The elements of all levels > 0 have a pair of indices as key. These indices point to elements of level 0. For example:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                       [a, b, c, d, e, f, g, u, y, z]                          │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

As a result, adding an element at level 0 automatically invalidates all its parents and its parents' successors. That means the minimum complexity of adding an element is Ω(size_level_0), which is too high.

I think this could be solved by using strings directly instead of indices as keys, like this:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                         a, b, c, d, e, f, g, u, y , z                         │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘
                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │         "ab" – "bac"          │         "gaf" – "wow"         │ "woz" – "zz"  │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

But as explained in search/facet/facet_string.rs, this would result in very long keys, and we would need to reduce the maximum allowed size of facet strings by half, to approximately 250 bytes (the max key size of LMDB is 511 bytes). If that is a big problem, we could consider only storing the start bound as a key maybe:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                         a, b, c, d, e, f, g, u, y , z                         │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘
                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             "ab"              │             "gaf"             │     "woz"     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     "ab"      │     "ba"      │     "gaf"     │    "form"     │     "woz"     │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

But I suspect it would make it difficult to use and/or create this structure.

loiclec · 2022-08-16T07:32:31Z

After thinking about it more, I think this PR should be reviewed as is for now, because:

It still brings a fairly large improvement in indexing speed
Adding code to incrementally update the database will be complicated and take a long time
Even after we have the ability to incrementally update the DB, we will still need to use this bit of code to build the DB from scratch

milli/src/update/facets.rs

Write the algorithm once for both numbers and strings

Kerollmops

Thank you very much!
bors merge

bors · 2022-08-17T12:00:58Z

Build succeeded:

loiclec added indexing Related to the documents/settings indexing algorithms. no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Jul 20, 2022

loiclec force-pushed the optimise-facets-indexation branch from a2c8bdf to 5728619 Compare July 20, 2022 07:50

loiclec assigned loiclec and ManyTheFish and unassigned loiclec Jul 20, 2022

loiclec marked this pull request as draft August 3, 2022 16:37

loiclec mentioned this pull request Aug 15, 2022

Improve speed of incremental indexing #605

Closed

4 tasks

loiclec changed the title ~~Optimise facets indexation~~ Optimise facets indexing Aug 16, 2022

Kerollmops suggested changes Aug 17, 2022

View reviewed changes

milli/src/update/facets.rs Show resolved Hide resolved

milli/src/update/facets.rs Show resolved Hide resolved

Loïc Lecrenier added 3 commits August 17, 2022 12:26

Refactor Facets::execute to increase performance

cf0cd92

Switch string facet levels indexation to new algo

8d4b21a

Write the algorithm once for both numbers and strings

Add documentation and comments to facets.rs

3968790

loiclec force-pushed the optimise-facets-indexation branch from 5728619 to 3968790 Compare August 17, 2022 10:43

Add some documentation to facets.rs

6cc9757

loiclec force-pushed the optimise-facets-indexation branch from 6e0fec6 to 6cc9757 Compare August 17, 2022 10:59

loiclec marked this pull request as ready for review August 17, 2022 11:30

loiclec requested a review from Kerollmops August 17, 2022 11:30

Kerollmops approved these changes Aug 17, 2022

View reviewed changes

bors bot merged commit 39869be into main Aug 17, 2022

bors bot deleted the optimise-facets-indexation branch August 17, 2022 12:01

curquiza mentioned this pull request Aug 24, 2022

Bad performance or crash during indexation meilisearch/meilisearch#2132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimise facets indexing #590

Optimise facets indexing #590

Uh oh!

loiclec commented Jul 20, 2022 •

edited

Loading

Uh oh!

loiclec commented Aug 3, 2022

Uh oh!

loiclec commented Aug 15, 2022 •

edited

Loading

Uh oh!

loiclec commented Aug 16, 2022

Uh oh!

Uh oh!

Uh oh!

Kerollmops left a comment

Uh oh!

bors bot commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimise facets indexing #590

Optimise facets indexing #590

Uh oh!

Conversation

loiclec commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

What does this PR do?

Notes

Benchmarks

Uh oh!

loiclec commented Aug 3, 2022

Uh oh!

loiclec commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loiclec commented Aug 16, 2022

Uh oh!

Uh oh!

Uh oh!

Kerollmops left a comment

Choose a reason for hiding this comment

Uh oh!

bors bot commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

loiclec commented Jul 20, 2022 •

edited

Loading

loiclec commented Aug 15, 2022 •

edited

Loading