Skip to content

Sparse index: optional skip list on top of doc values #13449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jun 13, 2024

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 4, 2024

Speaking to Adrien about how a sparse index would look like in lucene, he suggested that the sparse indexing does not need to be a new format bit an additional responsibility if DocValuesFormat.

The idea is to add an option to add a skip list on top of doc values and to expose it via the DocValuesSkipper abstraction, which has an API that is similar to what we're doing for impacts today. This provides a very simple index which can be very efficient when the index is sorted and the field belongs to the index sorting.

In order to implement it, we added a new flag in FieldType.java that configures whether to create a "skip index" for doc values. This flag is only allowed to be set on doc values of type NUMERIC, SORTED_NUMERIC, SORTED and SORTED_SET. Attempting to index other type of doc values with the flag set results on an exception.

This flag needs to be persisted on the FieldInfosFormat. This does not require a format change as we have some unused bit flags in Lucene94FieldInfosFormat that we can use.

We have changed the DocValuesFormat to generate the "skip index" whenever the flag is set. For this first implementation we went to the most basic implementation which consist in a skip list with just one level. In this level we collect the documents statistics every 4096 documents and we write them into the index. This basic structure already provides interesting numbers. I discussed with Adrien that as a follow up we should introduce more levels to the skip list and optimise the index for low cardinality fields.

In order to index a field with a skip list, we added static methods to the doc values field, for example NumericDocValuesField#indexedField(String name, long value) which will generated the right FieldType. In order to query it, you can use the existing NumericDocValuesField#newSlowRangeQuery(String field, long lowerValue, long upperValue). The generated query will use the skip index if exists by using the DocValuesRangeIterator.

Finally, here are some number I got using the geonames data set from lucene util.

The first test index the field called modified and adds the field as the primary sort of the index.

 Index LongField query LongField#newRangeQuery

  INDEX TIME: 42.604 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 12.402745246887207 MB
  QUERY TIME: 1157.7 ms
  QUERY DOCS: 6243379080 documents

Index LongField query IndexSortSortedNumericDocValuesRangeQuery

  INDEX TIME: 42.562 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 12.402745246887207 MB
  QUERY TIME: 662.6 ms
  QUERY DOCS: 6243379080 documents

Index Doc values skipping query SortedNumericDocValuesField#newSlowRangeQuery

  INDEX TIME: 38.927 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 11.800291061401367 MB
  QUERY TIME: 1072.5 ms
  QUERY DOCS: 6243379080 documents

This basic implementation is already faster that querying using the bkd tree. The IndexSortSortedNumericDocValuesRangeQuery is faster as it contains many optimisations but my expectation is that we can make this index as fast if not faster than this implementation.

The second test, we are indexing two fields and sorting the index using them; the countryCode as primary sort and the modified field as secondary sort. Then we execute the range queries on the modified field:

Index KeywordField, LongField query LongField#newRangeQuery

  INDEX TIME: 50.486 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 24.378992080688477 MB
  QUERY TIME: 1273.0 ms
  QUERY DOCS: 6243379080 documents

Index KeywordField, LongField query SortedNumericDocValuesField#newSlowRangeQuery

  INDEX TIME: 50.486 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 24.378992080688477 MB
  QUERY TIME: 13392.6 ms
  QUERY DOCS: 6243379080 documents

 Index Doc values skipping for both, query  SortedNumericDocValuesField#newSlowRangeQuery
  INDEX TIME: 44.127 sec
  INDEX DOCS: 12347608 documents
  INDEX SIZE: 16.09447193145752 MB
  QUERY TIME: 2975.0 ms
  QUERY DOCS: 6243379080 documents

In this case the query is slower than the BKD tree but still much faster than the brute approach. The advantage of the new query is that it does not need to build the big bitset that we might need to build with the BKD tree.

relates #11432

@jpountz jpountz added this to the 10.0.0 milestone Jun 4, 2024
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a full pass on changes again, and it looks good to me (disclaimed: I contributed to this branch). Would be good to have someone else take a look as well.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Just a few comments

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have minor comments.

The bulk of the implementation makes sense to me as a first step.

One weird thing is how we handle deletes. I am wondering about behavior when:

  • All docs in a skipper are deleted. We skip the skipper-entry due to its maxDocID actually being deleted as well.
  • Some of the docs deleted in a skipper. Our "min/max" at that point will not be accurate. We will still iterate the doc values in the skipper, which is OK. This just comes with the territory with deletes.

Are my ideas consistent with how we handle deletes?

I am pretty amazed that such a big change is only about 2k loc, many of which are just interface updates being populated.

docCount = input.readInt();
break;
} else {
input.skipBytes(24);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering why we didn't use vint for the values, but now I see we keep track of the block size. Could you make this a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block size is actually 28. We read the first 4 bytes to compute the maxDocID and we skip the rest if it is not competitive.

I am hesitant to add a constant at the moment as this might change if we introduce levels.

int docCount = meta.readInt();
int maxDocID = meta.readInt();

return new DocValuesSkipperEntry(offset, length, minValue, maxValue, docCount, maxDocID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we validate the skipper entry? My understanding is that length == 24*docCount right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation would look like:

 assert length == 28 * (1 + ((docCount - 1) / SKIP_INDEX_INTERVAL_SIZE));

As before I am hesitant to add the validation at the moment as this might change if we introduce levels.

*
* @throws IllegalArgumentException if they are not the same
*/
static void verifySameDocValuesSkipIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do the same check in FieldInfos#verifySameSchema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right but I wonder if we should optimize first FieldInfo.FieldNumbers. I open #13460 for consideration,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support on 7381364

false);
};
if (field.hasDocValuesSkipIndex()) {
writeSkipIndex(field, producer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the skip index in CheckIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added inital aupport on e782fef

iverase added 4 commits June 6, 2024 10:47
# Conflicts:
#	lucene/core/src/java/org/apache/lucene/document/SortedNumericDocValuesRangeQuery.java
#	lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesRangeQuery.java
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exciting first step!

🚀 🚀 🚀

Copy link
Contributor

@easyice easyice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@iverase
Copy link
Contributor Author

iverase commented Jun 12, 2024

If there are no more comments, I will be merging this PR tomorrow.

@iverase iverase merged commit 0487702 into apache:main Jun 13, 2024
3 checks passed
@iverase iverase deleted the sparse_index branch June 13, 2024 08:17
iverase added a commit that referenced this pull request Jun 17, 2024
…13487)

The introduction of the doc values skip index in #13449 broke the backward codec test as those codecs do not support
 it. This commit fix it by breaking up the base class for the tests.
@jpountz
Copy link
Contributor

jpountz commented Jun 18, 2024

FWIW I'm trying to use #11432 as a meta issue for sparse indexing and started listing tasks that I think we (ideally) need to complete to be in a good state for 9.0.

linfn pushed a commit to linfn/lucene that referenced this pull request Nov 5, 2024
…pache#13449)

Optional skip list on top of doc values which is exposed via the DocValuesSkipper abstraction. A new flag is
added to FieldType.java that configures whether to create a "skip index" for doc values.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
romseygeek added a commit to crate/crate that referenced this pull request Mar 5, 2025
- GITHUB#13449:  Sparse index, optional skip list on top of doc values (apache/lucene#13449)
- Introduce TestLucene90DocValuesFormatVariableSkipInterval for testing docvalues skipper index (apache/lucene#13550)
- Add levels to DocValues skipper index (apache/lucene#13563)
- Align doc value skipper interval boundaries when an interval contains a constant value (apache/lucene#13597)
mergify bot pushed a commit to crate/crate that referenced this pull request Mar 5, 2025
- GITHUB#13449:  Sparse index, optional skip list on top of doc values (apache/lucene#13449)
- Introduce TestLucene90DocValuesFormatVariableSkipInterval for testing docvalues skipper index (apache/lucene#13550)
- Add levels to DocValues skipper index (apache/lucene#13563)
- Align doc value skipper interval boundaries when an interval contains a constant value (apache/lucene#13597)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants