-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383
Comments
The other |
adding to the milestone for triage since the flakes look new |
These all 4 subpath tests call Also, I will create a PR to fix above. |
/priority critical-urgent |
@mkimuram while we should update the test with your PR, I'm not sure if it's going to fix the failure. Looking at one failing test log, these are the events I see on the failing pod:
So it seems to be failing before even reaching the subpath logic. |
Timeline of events: Test pod started getting processed by kubelet:
It took a minute after the test pod was created for the csi driver to finish registering and mount to succeed, and then subpath event to trigger:
So #71428 should address the issue. I was initially confused because the test logs didn't show the subpath event for some reason |
I opened up #71433 to investigate if 1 minute for csi plugin registration is expected |
I've added a tracker issue for the complete list of storage flakes: #71434 This is so that we can track whether we're addressing most of the flakes with the subpath fixes, or if additional fixes are required. |
keeping open until flakes are verified cleaned up |
/reopen Reopening this issue until we can verify the fix in CI |
As of this morning we are seeing CSI "subpath should fail" flakes in the following job
@msau42 @saad-ali can you help us determine if these are cases are flaky tests that you are fixing or is there an underlying CSI bug that needs addressing? |
@AishSundar root cause of this is a delay in an external sidecar containers and the tests not waiting long enough to accommodate the delay. There are no changes required in @msau42 @mkimuram looks like the same issue (#71433) is still being hit. See link:
And logs from
We need to make the tests handle this condition more gracefully. The suggestion here #71433 (comment) is one possible solution. |
Would be good to capture this as a known issue in release notes. |
@marpaia to followup on capturing this in Release notes |
@mkimuram, per #71474 (review), can you create a targeted fix increasing the timeout? Looks like marking these tests as flaky will be difficult for 1.13, but if that doesn't fix it we will need to do that. |
Short term fix to bump the test timeout: #71483 |
As per discussion in release burndown meeting, this is not a 1.13 blocker. @jberkus please move this issue to 1.14 if there are no pending issues you are concerned about |
Ok looks like the test still flaked. Looking at https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/22043, the mount succeeded after a minute and proceeded to subpath processing, which (correctly) failed. However, for some resaon, the test is not seeing the subpath event and still failed after 4 minutes. On kubelet, subpath processing failed after a minute and generated the event:
But the test doesn't see it for some reason, and still times out after 4 more minutes:
|
I saw the same on https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/422 CSI volume mount succeeded at:
Kubelet correctly failed the pod because of subpath
But test still failed:
Because the test failed to find the pod event. This looks like a test issue. Not a 1.13 blocker. |
@mkimuram can you PTAL at this |
Is there a way to check if events are overloaded and not able to handle new events? The MountVolume retries while the driver is not registered is very frequent and could have overloaded the event server. |
Checking for the existence of an event seems like the wrong way to test this. Would checking pod status alone be insufficient? Ultimately that is what we really care about. |
Checking for the specific event guarantees that we are failing with the correct error. But perhaps events are not reliable enough to depend on. Checking if the Pod never came up running would probably reduce the flakes but could potentially mask false negatives. cc @justinsb for any thoughts. |
Tests are still flaking and actually it seems like in general storage flakes in slow are flaking more since 11-21 or so: https://k8s-testgrid.appspot.com/sig-storage#gce-slow My theory is that the driver registration delay is causing a flood of events, which causes tests that check for events to fail. I think this is revealing multiple issues in both the tests and product, and we should work on fixing them all, although any one of the fixes should mitigate the test flakes:
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are failing: gce-new-master-upgrade-cluster-new-parallel
Which test(s) are failing:
Varies, but all subpath failures, including:
... and pretty much every other subpath test, but never all of them at once.
There's also a few other storage tests failing, such as:
Since when has it been failing: 11/22
Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20
Reason for failure:
These flakes started around the time that #71314 merged, but doesn't match up with the exact merge stamp, so it's probably coincidental.
The subpath test failures seem to be mostly timeouts:
... so possibly this is just GCE fail.
Anything else we need to know:
This test job has always been flaky, with around a 40% failure rate.
/kind flake
/sig storage
/priority important-soon
cc
@saad-ali @AishSundar @liggitt
The text was updated successfully, but these errors were encountered: