Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test timeout waiting for shards to rebalance #14556

Open
kkewwei opened this issue Jun 26, 2024 · 3 comments
Open

[BUG] Test timeout waiting for shards to rebalance #14556

kkewwei opened this issue Jun 26, 2024 · 3 comments
Labels
bug Something isn't working Cluster Manager

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Jun 26, 2024

Describe the bug

In the flaky test ClusterRerouteIT.testDelayWithALargeAmountOfShards #14510 , We can clearly see that the shards are rebalanced back and forth:
https://build.ci.opensearch.org/blue/rest/organizations/jenkins/pipelines/gradle-check/runs/41527/nodes/18/steps/32/log/?start=0

We can see from the log:

2024-06-22T23:45:36,760: node_t0 is shut down.
2024-06-22T23:45:37,042: node_t3 is elected as new cluster manager.
2024-06-22T23:45:49,898: all the indices are green.
2024-06-22T23:45:49,898 ~ 2024-06-22T23:47:39,526: the test8][4] and [test4][4] are rebalanced back and forth.
2024-06-22T23:47:39,526: ensureGreen timed out.

It seems to be a very low probability bug, I tried to reproduce several times, but failed, so open the issue to track the bug.

Related component

Cluster Manager

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@kkewwei kkewwei added bug Something isn't working untriaged labels Jun 26, 2024
@andrross andrross changed the title [BUG] the shards are rebalanced back and forth [BUG] Test timeout waiting for shards to rebalance Jun 26, 2024
@andrross
Copy link
Member

My understanding is that ClusterRerouteIT.testDelayWithALargeAmountOfShards created a lot of shards and took down a node. The cluster would turn green at the end of the test, but shard placement wasn't optimal, so the cluster would continue rebalancing towards the optimal state and sometimes the test would timeout waiting for shard rebalancing to stop. @kkewwei Is that right?

The question here is whether this is a more general pattern that is causing flakiness in other test cases.

@kkewwei
Copy link
Contributor Author

kkewwei commented Jun 30, 2024

@andrross, yes. The strange thing is that only two shards are balanced back and forth, and it lasts for a long time.

@rwali-aws
Copy link

[Triage - attendees 1 2 3 4 5 6]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants