Move data tier usage calculation to node level (#100230) #101128

gmarouli · 2023-10-19T12:02:49Z

Current situation
DataTiersUsageTransportAction executes an internal nodes stats action with all the trimmings:

 client.admin() 
     .cluster() 
     .prepareNodesStats() 
     .all() 
     .setIndices(CommonStatsFlags.ALL)

This puts a lot of memory pressure to the coordinating node (which in this case is always the elected master) that can cause further instabilities.

Proposed solution

We could trim down the data we need since we only care about docs and store per shard, that would reduce what needs to be kept in memory; however, with the optimisations of the many shards project, we would like to make it even more light weight.

We chose to do this and to push some parts of the calculation to the nodes themselves. Namely, each node sends to the elected master, grouped per preferred tier:

A set of all indices residing in this node, so it can be used to count them.
A list of all the primary shard sizes residing in this nodes, so we can calculate statistics on them.
The count of total shards on the node.
The count of total docs on the node.
The total store size.

The elected master then collects the data and aggregates them to one response.

Fixes: #100230

elasticsearchmachine · 2023-10-19T12:03:14Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2023-10-19T12:03:37Z

Hi @gmarouli, I've created a changelog YAML for you.

gmarouli · 2023-10-19T15:50:58Z

Reopened: #47875

gmarouli · 2023-10-19T15:51:12Z

@elasticmachine run elasticsearch-ci/part-2

gmarouli · 2023-10-30T16:13:02Z

@elasticmachine update branch

gmarouli · 2023-10-31T08:40:44Z

@elasticmachine run elasticsearch-ci/part-2

gmarouli · 2023-10-31T09:45:37Z

@elasticmachine update branch

elasticsearchmachine · 2023-10-31T09:45:49Z

Hi @gmarouli, I've updated the changelog YAML for you.

gmarouli · 2023-10-31T09:50:23Z

I am switching to build kite so I can re-run the failing test quicker.

I haven't been able to find the connection between the failing test and this code and I cannot reproduce it locally. But it appears that it fails consistently only in this PR so there must be something I am missing

gmarouli · 2023-10-31T11:54:16Z

I cannot figure out why the test is failing. I opened another PR in which I am going to apply the changes step by step and see which one is triggering the test failure.

gmarouli · 2023-10-31T11:56:43Z

Also I was just able to reproduce it the failure when it runs in the whole suite not as an individual test, I will follow up on this too.

gmarouli · 2023-11-01T11:43:45Z

Duplicate of #101599

gmarouli added 2 commits October 19, 2023 14:03

Move data tier usage calculation to node

469796a

Improve comments

e16ee4f

gmarouli added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 19, 2023

elasticsearchmachine added Team:Data Management Meta label for data/management team v8.12.0 labels Oct 19, 2023

Update docs/changelog/101128.yaml

c6a6490

Update 101128.yaml

71f6060

gmarouli marked this pull request as draft October 19, 2023 13:24

gmarouli added 2 commits October 19, 2023 18:01

Refactor

6cb9a0b

Fix test

06c59aa

Merge branch 'main' into collect-data-tiers-usage-efficiently

3d2b8c6

gmarouli mentioned this pull request Oct 31, 2023

[CI] XPackRestIT security/authz/14_cat_indices fails intermittently #47875

Closed

gmarouli added the buildkite-opt-in Opts your PR into Buildkite instead of Jenkins label Oct 31, 2023

elasticmachine and others added 2 commits October 31, 2023 20:45

Merge branch 'main' into collect-data-tiers-usage-efficiently

a4d16bf

Update docs/changelog/101128.yaml

271ca4c

gmarouli mentioned this pull request Oct 31, 2023

Move the calculation of data tier usage stats to individual nodes (#100230) #101599

Merged

gmarouli marked this as a duplicate of #101599 Nov 1, 2023

gmarouli closed this Nov 1, 2023

gmarouli mentioned this pull request Nov 10, 2023

Capture new action with a NodeFeature and fall back to NodeStats when necessary #102033

Closed

gmarouli deleted the collect-data-tiers-usage-efficiently branch December 10, 2024 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move data tier usage calculation to node level (#100230) #101128

Move data tier usage calculation to node level (#100230) #101128

Uh oh!

gmarouli commented Oct 19, 2023

Uh oh!

elasticsearchmachine commented Oct 19, 2023

Uh oh!

elasticsearchmachine commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 30, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

elasticsearchmachine commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Nov 1, 2023

Uh oh!

Uh oh!

Move data tier usage calculation to node level (#100230) #101128

Move data tier usage calculation to node level (#100230) #101128

Uh oh!

Conversation

gmarouli commented Oct 19, 2023

Uh oh!

elasticsearchmachine commented Oct 19, 2023

Uh oh!

elasticsearchmachine commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 19, 2023

Uh oh!

gmarouli commented Oct 30, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

elasticsearchmachine commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Oct 31, 2023

Uh oh!

gmarouli commented Nov 1, 2023

Uh oh!

Uh oh!