-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Ensure nested documents have consistent version and seq_ids #27455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Today we index dummy values for seq_ids and version on nested documents. This is on the one hand trappy since users can request these values via inner hits and on the other hand not necessarily good for compression since the dummy value will likely not compress well when seqIDs are lowish. This change ensures that we share the same field values for all documents in a nested block. This won't have any overhead, in-fact it might be more efficient since we even reduce the work needed slightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We copied this pattern from the version field:
for (int i = 1; i < context.docs().size(); i++) {
final Document doc = context.docs().get(i);
doc.add(new NumericDocValuesField(NAME, 1L));
}
Since the common version value is 1, I guess this works good for compression in the _version
case but doesn't work well for seq nos. I presume maintaining the semi linear increase pattern of seq# is indeed a good thing for compression but only @jpountz knows what black magic is done there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Today we index dummy values for seq_ids and version on nested documents. This is on the one hand trappy since users can request these values via inner hits and on the other hand not necessarily good for compression since the dummy value will likely not compress well when seqIDs are lowish. This change ensures that we share the same field values for all documents in a nested block. This won't have any overhead, in-fact it might be more efficient since we even reduce the work needed slightly.
Today we index dummy values for seq_ids and version on nested documents. This is on the one hand trappy since users can request these values via inner hits and on the other hand not necessarily good for compression since the dummy value will likely not compress well when seqIDs are lowish. This change ensures that we share the same field values for all documents in a nested block. This won't have any overhead, in-fact it might be more efficient since we even reduce the work needed slightly.
For the record, this did not matter before Lucene 7.0 since the random-access API made it hard to compress things efficiently, so we encoded entire segments with the same number of bits per value. This is changing as of 7.0 and we now perform block-based compression if it proves to save space. |
* master: (31 commits) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly #27454 Bump test version after backport Ensure nested documents have consistent version and seq_ids (#27455) Tests: Add Fedora-27 to packaging tests Delete some seemingly unused exceptions (#27439) #26800: Fix docs rendering Remove config prompting for secrets and text (#27216) Move the CLI into its own subproject (#27114) Correct usage of "an" to "a" in getting started docs Avoid NPE when getting build information Removes BWC snapshot status handler used in 6.x (#27443) Remove manual tracking of registered channels (#27445) Remove parameters on HandshakeResponseHandler (#27444) [GEO] fix pointsOnly bug for MULTIPOINT Standardize underscore requirements in parameters (#27414) peanut butter hamburgers Log primary-replica resync failures Uses TransportMasterNodeAction to update shard snapshot status (#27165) ...
* 6.x: (41 commits) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly #27454 Ensure nested documents have consistent version and seq_ids (#27455) Tests: Add Fedora-27 to packaging tests #26800: Fix docs rendering Move the CLI into its own subproject (#27114) Correct usage of "an" to "a" in getting started docs Avoid NPE when getting build information Remove manual tracking of registered channels (#27445) Standardize underscore requirements in parameters (#27414) Remove parameters on HandshakeResponseHandler (#27444) [GEO] fix pointsOnly bug for MULTIPOINT peanut butter hamburgers Uses TransportMasterNodeAction to update shard snapshot status (#27165) Log primary-replica resync failures Add limits for ngram and shingle settings (#27411) Enforce a minimum task execution and service time of 1 nanosecond Fix place-holder in allocation decider messages (#27436) Remove newline from log message (#27425) ...
Today we index dummy values for seq_ids and version on nested documents.
This is on the one hand trappy since users can request these values via
inner hits and on the other hand not necessarily good for compression since
the dummy value will likely not compress well when seqIDs are lowish.
This change ensures that we share the same field values for all documents in a
nested block. This won't have any overhead, in-fact it might be more efficient since
we even reduce the work needed slightly.