Skip to content

DataTiersUsageTransportAction is incredibly inefficient in large clusters #100230

Closed
@DaveCTurner

Description

@DaveCTurner

Today DataTiersUsageTransportAction executes an internal nodes stats action with all the trimmings:

client.admin()
.cluster()
.prepareNodesStats()
.all()
.setIndices(CommonStatsFlags.ALL)

In a large cluster this implementation may need hundreds of MiB of heap on the coordinating node to hold onto every statistic about every shard on every node (several kiB per shard) even though we use almost none of them. Worse, the coordinating node is always the elected master because that's how XPackUsageFeatureTransportAction derivatives work. It also burns a bunch of CPU and network bandwidth just transporting these stats around the cluster. AFAICT we could push this computation out to the individual nodes with a dedicated TransportNodesAction which computes the tiny TierSpecificStats on each node in a manner that allows the coordinating node to combine them.

It also does not propagate cancellation down to the nodes stats task (addressed in #100253)

It also captures the cluster state when it's initiated and retains it until completion, which can represent another 100MiB+ of heap usage.

Relates #77466.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions