Skip to content

Save a little space in agg tree #53730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 25, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 18, 2020

This drop the "top level" pipeline aggregators from the aggregation
result tree which should save a little memory and a few serialization
bytes. Perhaps more imporantly, this provides a mechanism by which we
can remove all pipelines from the aggregation result tree. This will
save quite a bit of space when pipelines are deep in the tree.

Sadly, doing this isn't simple because of backwards compatibility. Nodes
before 7.8.0 need those pipelines. We provide them by setting passing
a Supplier<PipelineTree> into the root of the aggregation tree that we
only call if we need to serialize to a version before 7.8.0.

This solution works for cross cluster search because we always reduce
the aggregations in each remote cluster and then forward them back to
the coordinating node. Its quite possible that the coordinating node
needs the pipeline (say it is version 7.1.0) and the gateway node in the
remote cluster doesn't (version 7.8.0). In that case the data nodes
won't send the pipeline aggregations back to the gateway node.
Critically, the gateway node will send the pipeline aggregations back
to the coordinating node. This is all managed with that
Supplier<PipelineTree>, but how it is managed is a bit tricky.

This drop the "top level" pipeline aggregators from the aggregation
result tree which should save a little memory and a few serialization
bytes. Perhaps more imporantly, this provides a mechanism by which we
can remove *all* pipelines from the aggregation result tree. This will
save quite a bit of space when pipelines are deep in the tree.

Sadly, doing this isn't simple because of backwards compatibility. Nodes
before 7.7.0 *need* those pipelines. We provide them by setting passing
a `Supplier<PipelineTree>` into the root of the aggregation tree that we
only call if we need to serialize to a version before 7.7.0.

This solution works for cross cluster search because we always reduce
the aggregations in each remote cluster and then forward them back to
the coordinating node. Its quite possible that the coordinating node
needs the pipeline (say it is version 7.1.0) and the gateway node in the
remote cluster doesn't (version 7.7.0). In that case the data nodes
won't send the pipeline aggregations back to the gateway node.
Critically, the gateway node *will* send the pipeline aggregations back
to the coordinating node. This is all managed with that
`Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments mainly around testing. Think this looks good! Serialization changes always give me the cold sweats but I can't see any flaws currently with the approach.

@@ -115,6 +115,39 @@
- match: { aggregations.cluster.buckets.0.key: "local_cluster" }
- match: { aggregations.cluster.buckets.0.doc_count: 5 }

# once more, this time with a pipeline agg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should we maybe split this out into it's own test? Just thinking that long multi-step yaml tests can be tricky to debug sometimes.

Haven't looked at the rest of the tests in this yaml though, so it might not be easy to move the indexing (or whatever else) steps up to the setup though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it, sure! They can get hard to debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - this is a test that doesn't clear indices after it runs. So moving things around is a bit more complex than we'd like to be honest. I can do it, but I think it should wait for a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, no worries then. not worth re-arranging everything for 👍

* Setting the pipeline tree source to null is here is correct but
* only because we don't immediately pass the InternalAggregations
* off to another node. Instead, we always reduce together with
* many aggregations and that always adds the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the comment trails off without finishing it's thought :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I do that somet

@@ -115,6 +115,39 @@
- match: { aggregations.cluster.buckets.0.key: "local_cluster" }
- match: { aggregations.cluster.buckets.0.doc_count: 5 }

# once more, this time with a pipeline agg
- do:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a similar test for the rolling-upgrade module? Theoretically it should be the same as CCS, but it might also smoke out different issues due to heterogeneous serialization inside the same cluster (instead of funneling through a gateway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to have a "mixed cluster CCS" test. I talked that one through with @javanna and we don't have one now and probably don't want to build one just for this.

terms:
field: f1.keyword
aggs:
s:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a non-top-level pipeline agg just to confirm they aren't affected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I'd get it in my next PR about non-top-level pipeline aggs, but I'm happy to do it now!

@nik9000
Copy link
Member Author

nik9000 commented Mar 24, 2020

@polyfractal, I think this is ready for another round!

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000
Copy link
Member Author

nik9000 commented Mar 24, 2020

Thanks @polyfractal !

@bpintea bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020
@nik9000 nik9000 merged commit e8c54c7 into elastic:master Mar 25, 2020
@nik9000
Copy link
Member Author

nik9000 commented Mar 25, 2020

I've updated all of the version numbers in the description to 7.8.0 because this is not making the 7.7.0 release train.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 25, 2020
This drop the "top level" pipeline aggregators from the aggregation
result tree which should save a little memory and a few serialization
bytes. Perhaps more imporantly, this provides a mechanism by which we
can remove *all* pipelines from the aggregation result tree. This will
save quite a bit of space when pipelines are deep in the tree.

Sadly, doing this isn't simple because of backwards compatibility. Nodes
before 7.7.0 *need* those pipelines. We provide them by setting passing
a `Supplier<PipelineTree>` into the root of the aggregation tree that we
only call if we need to serialize to a version before 7.7.0.

This solution works for cross cluster search because we always reduce
the aggregations in each remote cluster and then forward them back to
the coordinating node. Its quite possible that the coordinating node
needs the pipeline (say it is version 7.1.0) and the gateway node in the
remote cluster doesn't (version 7.7.0). In that case the data nodes
won't send the pipeline aggregations back to the gateway node.
Critically, the gateway node *will* send the pipeline aggregations back
to the coordinating node. This is all managed with that
`Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
nik9000 added a commit that referenced this pull request Mar 25, 2020
This drop the "top level" pipeline aggregators from the aggregation
result tree which should save a little memory and a few serialization
bytes. Perhaps more imporantly, this provides a mechanism by which we
can remove *all* pipelines from the aggregation result tree. This will
save quite a bit of space when pipelines are deep in the tree.

Sadly, doing this isn't simple because of backwards compatibility. Nodes
before 7.7.0 *need* those pipelines. We provide them by setting passing
a `Supplier<PipelineTree>` into the root of the aggregation tree that we
only call if we need to serialize to a version before 7.7.0.

This solution works for cross cluster search because we always reduce
the aggregations in each remote cluster and then forward them back to
the coordinating node. Its quite possible that the coordinating node
needs the pipeline (say it is version 7.1.0) and the gateway node in the
remote cluster doesn't (version 7.7.0). In that case the data nodes
won't send the pipeline aggregations back to the gateway node.
Critically, the gateway node *will* send the pipeline aggregations back
to the coordinating node. This is all managed with that
`Supplier<PipelineTree>`, but *how* it is managed is a bit tricky.
@nik9000
Copy link
Member Author

nik9000 commented Mar 25, 2020

Backport done! I'm going to leave the backport_pending label until I've reenabled bwc tests on master.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 25, 2020
Updates a few versions in serialization because we didn't make the 7.7.0
release train.
nik9000 added a commit that referenced this pull request Mar 25, 2020
Updates a few versions in serialization because we didn't make the 7.7.0
release train.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 26, 2020
This fixes pipeline aggregations used in cross cluster search from an older
version of Elasticsearch to a newer version of Elasticsearch. I broke
this in elastic#53730 when I was too aggressive in shutting off serialization
of pipeline aggs. In particular, this comes up when the coordinating
node is pre-7.8.0 and the gateway node is on or after 7.8.0.

The fix is another step down the line to remove pipeline aggregators
from the aggregation tree. Sort of. It create a new
`List<PipelineAggregator>` member in `InternalAggregation` *but* it is
only used for bwc serialization and it is fed by the mechanism
established in elastic#53730 to read the pipelines from the
nik9000 added a commit that referenced this pull request Mar 30, 2020
This fixes pipeline aggregations used in cross cluster search from an older
version of Elasticsearch to a newer version of Elasticsearch. I broke
this in #53730 when I was too aggressive in shutting off serialization
of pipeline aggs. In particular, this comes up when the coordinating
node is pre-7.8.0 and the gateway node is on or after 7.8.0.

The fix is another step down the line to remove pipeline aggregators
from the aggregation tree. Sort of. It create a new
`List<PipelineAggregator>` member in `InternalAggregation` *but* it is
only used for bwc serialization and it is fed by the mechanism
established in #53730 to read the pipelines from the
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 30, 2020
This fixes pipeline aggregations used in cross cluster search from an older
version of Elasticsearch to a newer version of Elasticsearch. I broke
this in elastic#53730 when I was too aggressive in shutting off serialization
of pipeline aggs. In particular, this comes up when the coordinating
node is pre-7.8.0 and the gateway node is on or after 7.8.0.

The fix is another step down the line to remove pipeline aggregators
from the aggregation tree. Sort of. It create a new
`List<PipelineAggregator>` member in `InternalAggregation` *but* it is
only used for bwc serialization and it is fed by the mechanism
established in elastic#53730 to read the pipelines from the
nik9000 added a commit that referenced this pull request Apr 2, 2020
This fixes pipeline aggregations used in cross cluster search from an older
version of Elasticsearch to a newer version of Elasticsearch. I broke
this in #53730 when I was too aggressive in shutting off serialization
of pipeline aggs. In particular, this comes up when the coordinating
node is pre-7.8.0 and the gateway node is on or after 7.8.0.

The fix is another step down the line to remove pipeline aggregators
from the aggregation tree. Sort of. It create a new
`List<PipelineAggregator>` member in `InternalAggregation` *but* it is
only used for bwc serialization and it is fed by the mechanism
established in #53730 to read the pipelines from the
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants