Skip to content

Use Lucene101 postings format by default #126080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 4, 2025

Conversation

jordan-powers
Copy link
Contributor

With this change, new standard indices will use the Lucene101PostingsFormat instead of the current default ES812PostingsFormat.

The current implementation enables the new format for all standard indices. We might consider instead only enabling the new format only for non-datastream standard indices, and/or only when index.codec is not set to best_compression.

Currently, use of the new codec is gated behind a feature flag.

We may or may not decide to backport this to 8.x depending on how the nightly benchmarks look.

Closes #119051

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment, LGTM otherwise.

assertThat(perFieldMapperCodec.getPostingsFormatForField("another_field"), instanceOf(ES812PostingsFormat.class));
assertThat(
perFieldMapperCodec.getPostingsFormatForField("another_field"),
instanceOf(timeSeries ? ES812PostingsFormat.class : Lucene101PostingsFormat.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests fail if build is in release mode. Can you add test-release label to this PR?
I think we should have an if check here. If feature flag is enabled we expect the lucene 101 posting format otherwise the es812 format.

@javanna
Copy link
Member

javanna commented Apr 2, 2025

Do we need to also test full bwc on this PR just to make sure? Backwards compatibility is handled via the new index version and applying the new codec on write only for newly created indices, correct? Do we need any additional testing compared to what we already have?

@martijnvg
Copy link
Member

Do we need to also test full bwc on this PR just to make sure?

I think we can also add the test-full-bwc label here.

Backwards compatibility is handled via the new index version and applying the new codec on write only for newly created indices, correct?

This isn't gated behind an index version (for now just feature flag). Any non tsdb/logsdb index (including existing indices) will start to use the new codec for new segments. The old segments remain to be used by the previous posting format. The codec name is stored in disk and that is used to load codec via service provider.

Do we need any additional testing compared to what we already have?

I think current mixed cluster / rolling upgrade integration testing is sufficient? We don't modify an existing posting format, just use a different one.

@jordan-powers
Copy link
Contributor Author

jordan-powers commented Apr 2, 2025

This isn't gated behind an index version (for now just feature flag). Any non tsdb/logsdb index (including existing indices) will start to use the new codec for new segments.

Actually, I did add a new index version and I check the index version when deciding which postings format to use. I wasn't sure about backwards compatibility--would lucene still read the old indices correctly if it's expecting the new postings format? If so, I can remove the check

@jordan-powers jordan-powers added test-full-bwc Trigger full BWC version matrix tests test-release Trigger CI checks against release build labels Apr 2, 2025
@martijnvg
Copy link
Member

Actually, I did add a new index version and I check the index version when deciding which postings format to use. I wasn't sure about backwards compatibility--would lucene still read the old indices correctly if it's expecting the new postings format? If so, I can remove the check

Whoops, you're right. I think we can remove the index version check. I don't gating behind index version is needed, given that changing codec on per field level should work (as long as the codec remains available in codebase).

@javanna
Copy link
Member

javanna commented Apr 2, 2025

Right, getPostingsFormatPerField is called on write. What has been already written has the old codec name on disk, and the appropriate formats will be loaded via SPI, they just need to be available in the classpath and registered to SPI, which is the case. This does mean that existing indices will get new segments written with the new format, if that's what you folks want to achieve here, as opposed to starting to use the new format only for new indices. I agree that if there is a feature flag, the index version is probably redundant and we should retain compatibility enabling and disabling the feature flag. Do we test that? Probably not, but I guess it does not matter as we won't enable the feature flag in prod?

@jordan-powers
Copy link
Contributor Author

This does mean that existing indices will get new segments written with the new format

I think that's the goal.

we should retain compatibility enabling and disabling the feature flag

Could you clarify what you mean by this? The feature flag is just there because we're not sure we want to push this to prod yet. Once we've benchmarked it and we're ready to make it generally available, I expect we'll remove the feature flag so that it's always enabled. It's not intended to be toggled on/off depending on compatibility.

My understanding is that compatibility is guaranteed because the formatted postings include the name of the format class, so when they're loaded, the service provider knows which class to use to decode them. Sure, if we need to write new postings to disk, they'll always be in the new format so indices might end up with a mix of different posting formats, but that's fine because each written postings file includes the codec name.

we won't enable the feature flag in prod?

Right, at this point we just want to see the effect on the nightly benchmarks

@martijnvg
Copy link
Member

I chatted with Luca and we both agree that when the feature flag gets removed we should introduce an index version, so that only new indices will use 101 postings codec. This makes reasoning about which postings format an index uses easier.

@jordan-powers
Copy link
Contributor Author

Ok, I'll reintroduce the index version check then.

@jordan-powers
Copy link
Contributor Author

Release tests are failing due to #124263. Waiting for that fix to be merged before I merge this PR.

@jordan-powers
Copy link
Contributor Author

Actually seems that #124263 won't fix the problem. Issue is being tracked in #126086. Will wait for resolution before merging this PR.

@jordan-powers jordan-powers merged commit 4c174a8 into elastic:main Apr 4, 2025
19 checks passed
@jordan-powers jordan-powers deleted the use-lucene-postings-format branch April 4, 2025 19:41
jordan-powers added a commit to jordan-powers/elasticsearch that referenced this pull request Apr 16, 2025
@@ -31,19 +34,33 @@
* vectors.
*/
public class PerFieldFormatSupplier {
public static final FeatureFlag USE_LUCENE101_POSTINGS_FORMAT = new FeatureFlag("use_lucene101_postings_format");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordan-powers @martijnvg Lucene 10.3 (still in development) will bring a new postings format. Is the intention that this feature flag to select the "latest" codec? Or is it really specific to 10.1? I ask only because we need to determine if a use_lucene101_postings_format is warranted or maybe make this flag more general?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is the use the latest default lucene postings format. Unfortunately the name doesn't reflect that, but we want it to use the latest default postings format.

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 21, 2025
Change PerFieldFormatSupplier to get default postings from current codec being used. This way the default codec for standard index mode isn't tied to Lucene101PostingsFormat specifically.

And rename the feature flag that controls when stock Lucene posting format is used.

Addressed comment from elastic#126080
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request May 21, 2025
martijnvg added a commit that referenced this pull request May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :StorageEngine/Codec Team:StorageEngine test-full-bwc Trigger full BWC version matrix tests test-release Trigger CI checks against release build v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use 'Lucene912PostingsFormat' when storage efficiency isn't critical
5 participants