-
Notifications
You must be signed in to change notification settings - Fork 52
WIP: Kubernetes v1.16 update #424
Conversation
8edb21f to
aa4ec3c
Compare
|
One more strange failure :-( : |
Got it, here is the issue: |
|
Please submit the Kubernetes fix upstream and mark this PR as WIP as long as it uses the forked Kubernetes. |
Submitted the fix to upstream kubernetes/kubernetes#83609 |
|
Amarnath Valluri <notifications@github.com> writes:
@pohly So, how should we deal with this PR. shall we wait for the
issue fixed in upstream, but moving to v1.16 base test code is needed
for testing both #216 and #380.
The PR got a favorable review, so we can be confident that it'll be
accepted and therefore we can use your temporary fork to move ahead with
this PR.
|
pohly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pmem-csi-driver: ListVolumes pagination should reflect intermediate changes
We keep a array copy of the current volumes at initial ListVolumes call, and
keep update(append) the list with newly added volumes.
This seems to be based on the assumption that at most one client is using ListVolumes and that therefore keeping information updated for that one client is okay. But that assumption cannot be made.
The CSI spec explicitly says that concurrent volume changes lead to inconsistent results in a client which uses paging: it may miss volumes (a volume was deleted which causes an unseen volume to land on a page which was retrieved earlier) or get duplicates (a volume is added before an already retrieved volume, which then gets reported twice).
Therefore a simpler approach where the token is an integer offset into the set of volumes should be good enough.
But the logic should work(with few corner cases which result in inconsistent list) even for calls coming from multiple clients, as we only depend on |
|
Amarnath Valluri <notifications@github.com> writes:
> This seems to be based on the assumption that at most one client is using `ListVolumes` and that therefore keeping information updated for that one client is okay. But that assumption cannot be made.
But the logic should work(with few corner cases which result in
inconsistent list) even for calls coming from multiple clients, as we
only depend on `startingIndex` set by the caller. And as we never
remove elements from the copy, we just extract elements from that
index.
The result is still going to be inconsistent. There's no way around
that. Therefore a simpler implementation is sufficient.
|
There is a test case for this: |
|
Amarnath Valluri <notifications@github.com> writes:
> The result is still going to be inconsistent. There's no way around that. Therefore a simpler implementation is sufficient.
There is a test case for this:
https://github.com/kubernetes-csi/csi-test/blob/82b05190c167c52bb6d5aaf2e1d7c833fa539783/pkg/sanity/controller.go#L200
Yes, I know. I've had that same debate already with other developers who
looked at those tests and wanted to implement ListVolumes such that it
always returns "correct" results.
Speaking of csi-sanity, would the implementation that I suggested fail
those tests?
|
I just tried your branch without the last commit, i.e. without changes to the ListVolume implementation, and it passed these tests. Because of the commit in this PR and your comment I thought PMEM-CSI would fail them and therefore you made these changes, but that doesn't seem to be the case? |
No, I worked on this change just because I faced with the test failure. Let me try again running the tests. |
Here is the failure I got: By the way, the test case is resulting in a leftover volume, for which I will submit the fix. |
|
I'm not hitting that problem. I am on f6bf5a4, which is your branch minus that last How does it fail for you? Does it also fail when running just that test? Perhaps it somehow interacts with other tests (leaked volumes)? |
What I meant is: what is the assertion that gets triggered? A full log of the failure may help. For reference, here's the output that I get: |
This failure was the result of manually deleting that leftover volume. so not related to this specific test failure. But, after a close look at the test case, It's passing just because of the end volume count after deletion and the addition of new volume(3 - 2 + 1 = 2) just matches Test fails by just altering the test case initial values: |
|
Amarnath Valluri <notifications@github.com> writes:
This failure was the result of manually deleting that leftover
volume. so not related to this specific test failure.
I still don't understand. How did you run into this problem if the test
passes as it exists now?
But, after a close look at the test case, It's passing just because of the end volume count after deletion and the addition of new volume(3 - 2 + 1 = 2) just matches `maxEntries`/`startingIndex`.
Test fails by just altering the test case initial values:
` minVolCount = 4` // from 3
` maxEntries = 3` // from 2
So in other words, the test is completely bogus and just passes
accidentally? Then let's remove it.
For reference, others have been unhappy about it before:
kubernetes-csi/csi-test#205
I was involved as reviewer, but I haven't written that test; obviously I
hadn't thought enough about it to catch the issue during review :-/
|
The failure was at DeleteVolume() of manually deleted leftover volume. |
In my opinion, the case is valid and makes sense to expect to get volumes created, and not the volumes deleted while in pagination. And even I like the idea of "timestamp" based listing suggested by @Akrog in kubernetes-csi/csi-test#205 |
But it's not required by the spec, and thus not a valid test. If a valid implementation fails the test, the test needs to be removed. |
|
@avalluri once you remove that last commit, we can merge this, right? |
|
Before we merge, we need to check whether we still run the desired checks. For example, https://github.com/jsafrane/kubernetes/blob/a9e838f2e9823947bb9a5b7c5fccd9b359b4a799/test/e2e/storage/testsuites/volumes.go has replaced many of the provisioning tests, so we need to add that new suite and potentially adapt our |
After enabling Volumes testsuite and adding This issue got fixed avalluri@bd357fb But the Volumes testsuite depends on host kubectl for running tests. So I would suggest dealing this with a separate PR. @pohly would you accept for this? |
|
@avalluri: can you cherry-pick this commit into your branch? |
There are some api breakages in test framework and K8s mount package, the fixes are in following commits. This change also having updated go modules by running: dep ensure -update
With Kuberntes v1.16 update DeviceOpened() method is moved from 'util/mount' package to 'volume/util/hostutil'.
Go 1.13 by default enables TLS 1.3. When that is active, the server seems to get less precise error messages from gRPC. Both of these were seen: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: failed to write client preface: write unix @->pmem-registry.sock: write: broken pipe" rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed It's okay to not know exactly what the problem is, as long as the connection isn't established.
With Go 1.13, the "testing" packages somehow aren't added in a "testing" init() function anymore. As a result, "go test ./test/e2e" fails because it passes a "-test" parameter that isn't registered yet when init() in e2e_test.go parses parameters. The solution is to make test suite initialization a bit closer to what Go documents as the right approach for parameter handling, i.e. doing it in a TestMain function (https://golang.org/pkg/testing/#hdr-Main).
Kubernetes v1.16 has made some(cleanup,reshuffle) changes to e2e test framework api that needs changes to e2e test code.
IDGen interface was added in recent sanity upgrade to 2.2.0
Upstream change 04300826fd0d719bb166905bb6c8a3286a171465 has introduced "e2e/common" package dependency on storage testusuite. This resulted in running all unneeded common tests while running storage tests in pmem-csi. This is an intermediate step till the fix is backported to 1.16 branch: kubernetes/kubernetes#83776
Our local changes are in Kubernetes 1.16, therefore we can remove the locally modified copy.
Do you mean a separate PR against upstream Kubernetes and a local copy of that change in your fork of Kubernetes 1.16? @okartau was working on removing the kubectl dependency. I don't know whether that is ready to be upstreamed. |
This would be one way to deal, but in fact, my opinion is that enabling Volumes test-suite is not related/specific to k1.16 update. We can handle this separately
|
It is, because without enabling it, our test coverage regresses.
Probably the easier solution. |
|
I filed kubernetes/kubernetes#83913 for the kubectl dependency. |
I couldn't understand this, how could it regress a testsuite which was not enabled before ? |
The regression is that we run less tests than before. That's because important tests were moved out of the testsuite that we have enabled into the one that we don't have enabled: kubernetes/kubernetes@a9e838f#diff-6dc833c65b9c544adfb2d654d2f7cbd3 In particular, "should provision storage with defaults" is gone if we don't enable the volumes testsuite. |
Ok, got it. Thanks for the information. As now #432 got merged, I will close this PR. |
No description provided.