[BUG] Test timeout waiting for shards to rebalance #14556

kkewwei · 2024-06-26T09:29:35Z

Describe the bug

In the flaky test ClusterRerouteIT.testDelayWithALargeAmountOfShards #14510 , We can clearly see that the shards are rebalanced back and forth:
https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/gradle-check/runs/41527/nodes/18/steps/32/log/?start=0

We can see from the log：

2024-06-22T23:45:36,760: node_t0 is shut down.
2024-06-22T23:45:37,042: node_t3 is elected as new cluster manager.
2024-06-22T23:45:49,898: all the indices are green.
2024-06-22T23:45:49,898 ~ 2024-06-22T23:47:39,526: the test8][4] and [test4][4] are rebalanced back and forth.
2024-06-22T23:47:39,526: ensureGreen timed out.

It seems to be a very low probability bug, I tried to reproduce several times, but failed, so open the issue to track the bug.

Related component

Cluster Manager

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

andrross · 2024-06-26T15:55:02Z

My understanding is that ClusterRerouteIT.testDelayWithALargeAmountOfShards created a lot of shards and took down a node. The cluster would turn green at the end of the test, but shard placement wasn't optimal, so the cluster would continue rebalancing towards the optimal state and sometimes the test would timeout waiting for shard rebalancing to stop. @kkewwei Is that right?

The question here is whether this is a more general pattern that is causing flakiness in other test cases.

kkewwei · 2024-06-30T02:05:04Z

@andrross, yes. The strange thing is that only two shards are balanced back and forth, and it lasts for a long time.

rwali-aws · 2024-07-11T06:52:04Z

[Triage - attendees 1 2 3 4 5 6]

kkewwei added bug Something isn't working untriaged labels Jun 26, 2024

github-actions bot added the Cluster Manager label Jun 26, 2024

andrross changed the title ~~[BUG] the shards are rebalanced back and forth~~ [BUG] Test timeout waiting for shards to rebalance Jun 26, 2024

rwali-aws removed the untriaged label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Test timeout waiting for shards to rebalance #14556

[BUG] Test timeout waiting for shards to rebalance #14556

kkewwei commented Jun 26, 2024 •

edited

Loading

andrross commented Jun 26, 2024

kkewwei commented Jun 30, 2024

rwali-aws commented Jul 11, 2024

[BUG] Test timeout waiting for shards to rebalance #14556

[BUG] Test timeout waiting for shards to rebalance #14556

Comments

kkewwei commented Jun 26, 2024 • edited Loading

Describe the bug

Related component

andrross commented Jun 26, 2024

kkewwei commented Jun 30, 2024

rwali-aws commented Jul 11, 2024

kkewwei commented Jun 26, 2024 •

edited

Loading