Skip to content

Ensure nested documents have consistent version and seq_ids #27455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2017

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Nov 20, 2017

Today we index dummy values for seq_ids and version on nested documents.
This is on the one hand trappy since users can request these values via
inner hits and on the other hand not necessarily good for compression since
the dummy value will likely not compress well when seqIDs are lowish.

This change ensures that we share the same field values for all documents in a
nested block. This won't have any overhead, in-fact it might be more efficient since
we even reduce the work needed slightly.

Today we index dummy values for seq_ids and version on nested documents.
This is on the one hand trappy since users can request these values via
inner hits and on the other hand not necessarily good for compression since
the dummy value will likely not compress well when seqIDs are lowish.

This change ensures that we share the same field values for all documents in a
nested block. This won't have any overhead, in-fact it might be more efficient since
we even reduce the work needed slightly.
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We copied this pattern from the version field:

        for (int i = 1; i < context.docs().size(); i++) {
            final Document doc = context.docs().get(i);
            doc.add(new NumericDocValuesField(NAME, 1L));
        }

Since the common version value is 1, I guess this works good for compression in the _version case but doesn't work well for seq nos. I presume maintaining the semi linear increase pattern of seq# is indeed a good thing for compression but only @jpountz knows what black magic is done there.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@s1monw s1monw merged commit 720e96e into elastic:master Nov 20, 2017
s1monw added a commit that referenced this pull request Nov 20, 2017
Today we index dummy values for seq_ids and version on nested documents.
This is on the one hand trappy since users can request these values via
inner hits and on the other hand not necessarily good for compression since
the dummy value will likely not compress well when seqIDs are lowish.

This change ensures that we share the same field values for all documents in a
nested block. This won't have any overhead, in-fact it might be more efficient since
we even reduce the work needed slightly.
s1monw added a commit that referenced this pull request Nov 20, 2017
Today we index dummy values for seq_ids and version on nested documents.
This is on the one hand trappy since users can request these values via
inner hits and on the other hand not necessarily good for compression since
the dummy value will likely not compress well when seqIDs are lowish.

This change ensures that we share the same field values for all documents in a
nested block. This won't have any overhead, in-fact it might be more efficient since
we even reduce the work needed slightly.
s1monw added a commit that referenced this pull request Nov 20, 2017
@s1monw s1monw removed the :Internal label Nov 20, 2017
@jpountz
Copy link
Contributor

jpountz commented Nov 20, 2017

I presume maintaining the semi linear increase pattern of seq# is indeed a good thing for compression but only @jpountz knows what black magic is done there.

For the record, this did not matter before Lucene 7.0 since the random-access API made it hard to compress things efficiently, so we encoded entire segments with the same number of bits per value. This is changing as of 7.0 and we now perform block-based compression if it proves to save space.

jasontedor added a commit that referenced this pull request Nov 20, 2017
* master: (31 commits)
  [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure
  Transition transport apis to use void listeners (#27440)
  AwaitsFix GeoShapeQueryTests#testPointsOnly #27454
  Bump test version after backport
  Ensure nested documents have consistent version and seq_ids (#27455)
  Tests: Add Fedora-27 to packaging tests
  Delete some seemingly unused exceptions (#27439)
  #26800: Fix docs rendering
  Remove config prompting for secrets and text (#27216)
  Move the CLI into its own subproject (#27114)
  Correct usage of "an" to "a" in getting started docs
  Avoid NPE when getting build information
  Removes BWC snapshot status handler used in 6.x (#27443)
  Remove manual tracking of registered channels (#27445)
  Remove parameters on HandshakeResponseHandler (#27444)
  [GEO] fix pointsOnly bug for MULTIPOINT
  Standardize underscore requirements in parameters (#27414)
  peanut butter hamburgers
  Log primary-replica resync failures
  Uses TransportMasterNodeAction to update shard snapshot status (#27165)
  ...
jasontedor added a commit that referenced this pull request Nov 20, 2017
* 6.x: (41 commits)
  [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure
  Transition transport apis to use void listeners (#27440)
  AwaitsFix GeoShapeQueryTests#testPointsOnly #27454
  Ensure nested documents have consistent version and seq_ids (#27455)
  Tests: Add Fedora-27 to packaging tests
  #26800: Fix docs rendering
  Move the CLI into its own subproject (#27114)
  Correct usage of "an" to "a" in getting started docs
  Avoid NPE when getting build information
  Remove manual tracking of registered channels (#27445)
  Standardize underscore requirements in parameters (#27414)
  Remove parameters on HandshakeResponseHandler (#27444)
  [GEO] fix pointsOnly bug for MULTIPOINT
  peanut butter hamburgers
  Uses TransportMasterNodeAction to update shard snapshot status (#27165)
  Log primary-replica resync failures
  Add limits for ngram and shingle settings (#27411)
  Enforce a minimum task execution and service time of 1 nanosecond
  Fix place-holder in allocation decider messages (#27436)
  Remove newline from log message (#27425)
  ...
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Nested Docs labels Feb 14, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v6.0.1 v6.1.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants