Sparse index: optional skip list on top of doc values #13449

iverase · 2024-06-04T10:20:59Z

Speaking to Adrien about how a sparse index would look like in lucene, he suggested that the sparse indexing does not need to be a new format bit an additional responsibility if DocValuesFormat.

The idea is to add an option to add a skip list on top of doc values and to expose it via the DocValuesSkipper abstraction, which has an API that is similar to what we're doing for impacts today. This provides a very simple index which can be very efficient when the index is sorted and the field belongs to the index sorting.

In order to implement it, we added a new flag in FieldType.java that configures whether to create a "skip index" for doc values. This flag is only allowed to be set on doc values of type NUMERIC, SORTED_NUMERIC, SORTED and SORTED_SET. Attempting to index other type of doc values with the flag set results on an exception.

This flag needs to be persisted on the FieldInfosFormat. This does not require a format change as we have some unused bit flags in Lucene94FieldInfosFormat that we can use.

We have changed the DocValuesFormat to generate the "skip index" whenever the flag is set. For this first implementation we went to the most basic implementation which consist in a skip list with just one level. In this level we collect the documents statistics every 4096 documents and we write them into the index. This basic structure already provides interesting numbers. I discussed with Adrien that as a follow up we should introduce more levels to the skip list and optimise the index for low cardinality fields.

In order to index a field with a skip list, we added static methods to the doc values field, for example NumericDocValuesField#indexedField(String name, long value) which will generated the right FieldType. In order to query it, you can use the existing NumericDocValuesField#newSlowRangeQuery(String field, long lowerValue, long upperValue). The generated query will use the skip index if exists by using the DocValuesRangeIterator.

Finally, here are some number I got using the geonames data set from lucene util.

The first test index the field called modified and adds the field as the primary sort of the index.

 Index LongField query LongField#newRangeQuery

  INDEX TIME: 42.604 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 12.402745246887207 MB
  QUERY TIME: 1157.7 ms
  QUERY DOCS: 6243379080 documents

Index LongField query IndexSortSortedNumericDocValuesRangeQuery

  INDEX TIME: 42.562 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 12.402745246887207 MB
  QUERY TIME: 662.6 ms
  QUERY DOCS: 6243379080 documents

Index Doc values skipping query SortedNumericDocValuesField#newSlowRangeQuery

  INDEX TIME: 38.927 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 11.800291061401367 MB
  QUERY TIME: 1072.5 ms
  QUERY DOCS: 6243379080 documents

This basic implementation is already faster that querying using the bkd tree. The IndexSortSortedNumericDocValuesRangeQuery is faster as it contains many optimisations but my expectation is that we can make this index as fast if not faster than this implementation.

The second test, we are indexing two fields and sorting the index using them; the countryCode as primary sort and the modified field as secondary sort. Then we execute the range queries on the modified field:

Index KeywordField, LongField query LongField#newRangeQuery

  INDEX TIME: 50.486 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 24.378992080688477 MB
  QUERY TIME: 1273.0 ms
  QUERY DOCS: 6243379080 documents

Index KeywordField, LongField query SortedNumericDocValuesField#newSlowRangeQuery

  INDEX TIME: 50.486 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 24.378992080688477 MB
  QUERY TIME: 13392.6 ms
  QUERY DOCS: 6243379080 documents

 Index Doc values skipping for both, query  SortedNumericDocValuesField#newSlowRangeQuery
  INDEX TIME: 44.127 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 16.09447193145752 MB
  QUERY TIME: 2975.0 ms
  QUERY DOCS: 6243379080 documents

In this case the query is slower than the BKD tree but still much faster than the brute approach. The advantage of the new query is that it does not need to build the big bitset that we might need to build with the BKD tree.

relates #11432

jpountz

I just did a full pass on changes again, and it looks good to me (disclaimed: I contributed to this branch). Would be good to have someone else take a look as well.

ChrisHegarty

This looks great. Just a few comments

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

benwtrent

I have minor comments.

The bulk of the implementation makes sense to me as a first step.

One weird thing is how we handle deletes. I am wondering about behavior when:

All docs in a skipper are deleted. We skip the skipper-entry due to its maxDocID actually being deleted as well.
Some of the docs deleted in a skipper. Our "min/max" at that point will not be accurate. We will still iterate the doc values in the skipper, which is OK. This just comes with the territory with deletes.

Are my ideas consistent with how we handle deletes?

I am pretty amazed that such a big change is only about 2k loc, many of which are just interface updates being populated.

benwtrent · 2024-06-05T19:56:26Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

+              docCount = input.readInt();
+              break;
+            } else {
+              input.skipBytes(24);


Ah, I was wondering why we didn't use vint for the values, but now I see we keep track of the block size. Could you make this a constant?

The block size is actually 28. We read the first 4 bytes to compute the maxDocID and we skip the rest if it is not competitive.

I am hesitant to add a constant at the moment as this might change if we introduce levels.

benwtrent · 2024-06-05T20:06:47Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

+    int docCount = meta.readInt();
+    int maxDocID = meta.readInt();
+
+    return new DocValuesSkipperEntry(offset, length, minValue, maxValue, docCount, maxDocID);


Could we validate the skipper entry? My understanding is that length == 24*docCount right?

The validation would look like:

assert length == 28 * (1 + ((docCount - 1) / SKIP_INDEX_INTERVAL_SIZE));

As before I am hesitant to add the validation at the moment as this might change if we introduce levels.

easyice · 2024-06-06T08:28:07Z

lucene/core/src/java/org/apache/lucene/index/FieldInfo.java

+   *
+   * @throws IllegalArgumentException if they are not the same
+   */
+  static void verifySameDocValuesSkipIndex(


Do we need to do the same check in FieldInfos#verifySameSchema?

I think you are right but I wonder if we should optimize first FieldInfo.FieldNumbers. I open #13460 for consideration,

Added support on 7381364

easyice · 2024-06-06T08:34:49Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java

-        false);
+        };
+    if (field.hasDocValuesSkipIndex()) {
+      writeSkipIndex(field, producer);


Do we need to check the skip index in CheckIndex?

Added inital aupport on e782fef

# Conflicts: # lucene/core/src/java/org/apache/lucene/document/SortedNumericDocValuesRangeQuery.java # lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesRangeQuery.java

benwtrent

This is an exciting first step!

🚀 🚀 🚀

easyice

Great work!

iverase · 2024-06-12T08:34:05Z

If there are no more comments, I will be merging this PR tomorrow.

…13487) The introduction of the doc values skip index in #13449 broke the backward codec test as those codecs do not support it. This commit fix it by breaking up the base class for the tests.

jpountz · 2024-06-18T07:18:16Z

FWIW I'm trying to use #11432 as a meta issue for sparse indexing and started listing tasks that I think we (ideally) need to complete to be in a good state for 9.0.

…pache#13449) Optional skip list on top of doc values which is exposed via the DocValuesSkipper abstraction. A new flag is added to FieldType.java that configures whether to create a "skip index" for doc values. Co-authored-by: Adrien Grand <jpountz@gmail.com>

- GITHUB#13449: Sparse index, optional skip list on top of doc values (apache/lucene#13449) - Introduce TestLucene90DocValuesFormatVariableSkipInterval for testing docvalues skipper index (apache/lucene#13550) - Add levels to DocValues skipper index (apache/lucene#13563) - Align doc value skipper interval boundaries when an interval contains a constant value (apache/lucene#13597)

iverase and others added 26 commits May 29, 2024 09:45

Initial draft for sparse index on top of doc values

7d1dce8

iter

8acc67e

Improve javadocs.

970bfe8

FieldInfos

7156ceb

iter

0891f25

Add checks that we don't advance doc values with Match.YES.

271fb08

Fix SimpleText failure.

6708620

Fix test failure with field infos due to invalid combination.

da83067

Tidy.

57870c7

Remove useless leniency.

aa42209

Improve document API.

1bac5e4

Rename absminvalue -> minvalue and minvalue -> origin.

eb8152d

Remove dead code.

f36adc3

minor cleanup

f053bdb

Fix DocValueType compatibility check.

fedfff4

tidy

2ecb138

Test taking advantage of multiple levels in DocValuesRangeIterator.

542e824

Fix/simplify level testing.

202da01

Fix bug when all values are merged away.

a0d0212

Fix tests with SimpleTextDocValuesFormat.

beaf4c2

Add checks for docCount on empty skippers.

9bf8de4

make test small medium and big

560ee85

remove unreachable condition

88e45b3

Indexing chain should prevent mixed setup

2e88863

Merge remote-tracking branch 'upstream/main' into sparse_index

f4c5b02

iter

7ce5c2d

jpountz added this to the 10.0.0 milestone Jun 4, 2024

Replace while loop with do..while loop.

9a8ba61

jpountz approved these changes Jun 4, 2024

View reviewed changes

ChrisHegarty reviewed Jun 4, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java Outdated Show resolved Hide resolved

gf2121 approved these changes Jun 5, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java Outdated Show resolved Hide resolved

iverase and others added 3 commits June 5, 2024 13:49

review suggestions

d83c97d

Merge branch 'main' into sparse_index

d4918eb

Add prefetching for skip indexes

0a6153a

benwtrent reviewed Jun 5, 2024

View reviewed changes

optimize for matchAll

6f0259c

easyice reviewed Jun 6, 2024

View reviewed changes

iverase added 4 commits June 6, 2024 10:47

revert optimize for matchAll

6071604

Merge branch 'main' into sparse_index

f61d895

# Conflicts: # lucene/core/src/java/org/apache/lucene/document/SortedNumericDocValuesRangeQuery.java # lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesRangeQuery.java

verify schema in FieldInfos

7381364

Add checkIndex support

e782fef

benwtrent approved these changes Jun 7, 2024

View reviewed changes

easyice approved these changes Jun 9, 2024

View reviewed changes

iverase added 2 commits June 10, 2024 12:19

Merge remote-tracking branch 'upstream/main' into sparse_index

4bf7d64

Add entry in CHANGES.txt

ac09c23

Merge branch 'main' into sparse_index

b45ec2b

iverase merged commit 0487702 into apache:main Jun 13, 2024
3 checks passed

iverase deleted the sparse_index branch June 13, 2024 08:17

iverase mentioned this pull request Jun 14, 2024

Fix backward codec test after introducing the doc values skip index #13487

Merged

jpountz mentioned this pull request Jun 18, 2024

Add support for sparse indexing [LUCENE-10396] #11432

Closed

5 tasks

iverase mentioned this pull request Jul 8, 2024

Introduce TestLucene90DocValuesFormatVariableSkipInterval for testing docvalues skipper index #13550

Merged

animodak7 mentioned this pull request Mar 27, 2025

[Feature Request] Use lucene sparse index in opensearch opensearch-project/OpenSearch#17710

Open

gerlowskija mentioned this pull request Apr 18, 2025

SOLR-17631: Upgrade main to Lucene 10.x apache/solr#3053

Open

kkewwei mentioned this pull request Apr 24, 2025

Add SkipIndex in SortedNumericDocValuesSetQuery #14551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse index: optional skip list on top of doc values #13449

Sparse index: optional skip list on top of doc values #13449

iverase commented Jun 4, 2024

jpountz left a comment

ChrisHegarty left a comment

benwtrent left a comment

benwtrent Jun 5, 2024

iverase Jun 6, 2024

benwtrent Jun 5, 2024

iverase Jun 6, 2024

easyice Jun 6, 2024

iverase Jun 6, 2024

iverase Jun 7, 2024

easyice Jun 6, 2024

iverase Jun 7, 2024

benwtrent left a comment

easyice left a comment

iverase commented Jun 12, 2024

jpountz commented Jun 18, 2024

Sparse index: optional skip list on top of doc values #13449

Sparse index: optional skip list on top of doc values #13449

Conversation

iverase commented Jun 4, 2024

jpountz left a comment

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

easyice left a comment

Choose a reason for hiding this comment

iverase commented Jun 12, 2024

jpountz commented Jun 18, 2024