-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing #321
Comments
We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge. Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week. Thanks. |
We have decided to test switching the default runner to m58xlarge next week. |
New spec live. Monitoring a bit. |
Closing this issue as the changes were completed. |
Is your feature request related to a problem? Please describe
Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during
gradlew check
on PRs in the OpenSearch core repository during regular business hours.The concrete test failures we're noticing are similar to:
As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using
Scope.Test
level, which fires up a new cluster for each test method).With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity,
--parallel
gradle invocation, and size of the Runner instance?Describe the solution you'd like
As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).
It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).
Describe alternatives you've considered
Additional context
Thank you!
The text was updated successfully, but these errors were encountered: