Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Devicemanager: New error codes + Device Manager unit tests and few other fixes. #420

Merged
merged 7 commits into from
Nov 22, 2019

Conversation

avalluri
Copy link
Contributor

@avalluri avalluri commented Sep 25, 2019

This PR address a couple of issues:

  • Defined new error codes that are returned by DeviceManager interface methods. Made changes to implementations to use these error codes and revised the error messages using error formatting features provided by go-1.13 fmt package.

  • Discussion on issue cache data in LVM device manager? #413, raised a requirement that PMEM-CSI should handle the DeleteVolume() call for volumes that the underlined device already deleted manually by the admin. So for such a volume we just ignore and return no error in DeviceManager.DeleteVolume(). This also makes DeleteVolume() idempotent.

  • Issue DeleteVolume response is incorrect if volume is in use #402 : CSI.DeleteVolume should return FailedPreCondition error in case of the volume is in use. Test case added for the same.

  • Unit tests for device manager

  • This PR also contains a commit that removes the unused code in DeviceManager.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this, both for direct mode and LVM mode?

@avalluri
Copy link
Contributor Author

avalluri commented Oct 9, 2019

Can you add tests for this, both for direct mode and LVM mode?

Tests for device manager is bit challenging, as they deal with devices. I stared with LVM tests: avalluri@c7437e5 .
The ugly part is how we run the test, as it needs sudo permissions. I tried running test inside the container and ended up like this:avalluri@c7437e5#diff-436cc358f7f160c8242374b3f820867d

@avalluri avalluri force-pushed the devicemanager branch 5 times, most recently from 1bc573f to 3cbca7b Compare October 23, 2019 12:49
@avalluri
Copy link
Contributor Author

Can you add tests for this, both for direct mode and LVM mode?

@pohly, I added unit tests for LVM, but I am not sure yet about covering ndctl manager, which needs a bit thinking how to mimic the pmem device.

@@ -0,0 +1,229 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name of that file seems misspelled, should be pmem-lvm_test.go ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be pmd-lvm_test.go. I fixed it now.

vendor-bom.csv Outdated Show resolved Hide resolved
@avalluri
Copy link
Contributor Author

avalluri commented Oct 24, 2019

Device manager tests failed in CI while setting up loop device:

[2019-10-24T07:52:16.650Z] DeviceManage LVM
[2019-10-24T07:52:16.650Z] /go/src/github.com/intel/pmem-csi/pkg/pmem-device-manager/pmd-lvm_test.go:35
[2019-10-24T07:52:16.650Z]   LVM [BeforeEach]
[2019-10-24T07:52:16.650Z]   /go/src/github.com/intel/pmem-csi/pkg/pmem-device-manager/pmd-lvm_test.go:54
[2019-10-24T07:52:16.650Z]     Should create a new device
[2019-10-24T07:52:16.650Z]     /go/src/github.com/intel/pmem-csi/pkg/pmem-device-manager/pmd-lvm_test.go:55
[2019-10-24T07:52:16.650Z] 
[2019-10-24T07:52:16.650Z]     Failed to create volume group
[2019-10-24T07:52:16.650Z]     Expected
[2019-10-24T07:52:16.650Z]         <*errors.errorString | 0xc000021700>: {
[2019-10-24T07:52:16.650Z]             s: "could not open /dev/loop-control: open /dev/loop-control: no such file or directory",
[2019-10-24T07:52:16.650Z]         }
[2019-10-24T07:52:16.650Z]     to be nil
[2019-10-24T07:52:16.650Z] 
[2019-10-24T07:52:16.650Z]     /go/src/github.com/intel/pmem-csi/pkg/pmem-device-manager/pmd-lvm_test.go:42

@avalluri avalluri mentioned this pull request Oct 28, 2019
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #383 (review), please also revise error reporting and checking such that the error code returned by ns.cs.dm.GetDevice in nodeserver.go is checked.

@avalluri avalluri force-pushed the devicemanager branch 4 times, most recently from 5aa343e to 127e7ac Compare October 30, 2019 11:48
Makefile Outdated
@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

GO=GOOS=linux GO111MODULE=off go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why GOOS=linux?

Copy link
Contributor Author

@avalluri avalluri Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why GOOS=linux?

This is not added part of this change. It was existing before, I just moved from that place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't consistent then. I was looking at the first changed line and that only had:

build: $(CMDS) $(TEST_CMDS)
	go test -run none ./pkg/... ./test/e2e

@avalluri avalluri requested a review from pohly October 31, 2019 07:31
@avalluri avalluri changed the title Devicemanager: Return no error if the device not found while deleting Devicemanager: New error codes + LVM unit test and few other fixes. Oct 31, 2019
@avalluri avalluri requested a review from okartau October 31, 2019 08:02
Dockerfile Outdated Show resolved Hide resolved
test/test.make Outdated
$(TEST_CMD) $(shell go list $(TEST_ARGS) | sed -e 's;$(IMPORT_PATH);.;')
$(TEST_CMD) $(shell $(GO) list $(TEST_ARGS) | grep -v pmem-device-manager | sed -e 's;$(IMPORT_PATH);.;')
RUN_DM_TESTS = \
$(TEST_CMD) ./pkg/pmem-device-manager -exec $(SUDO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes "make test" depend on sudo, which is not currently a requirement and also shouldn't be. Developers should be able to build and test on a machine with a user account that isn't allowed to become root.

The other issue is that the test depends on /dev/loop-control, right?

Let me propose a different approach to running tests that need to be run in a certain environment:

  • in the test code, check those preconditions (are root, have /dev/loop-control): if not met, skip the test
  • integrate the running of unit tests into our E2E testing:
    • build test binaries with go test -c
    • copy test binaries into the first worker
    • run them there as root

This is more complicated, but it also gives us better test coverage, because then operations that may depend on the OS kernel of the cluster node actually get tested with that kernel. We can also do unit tests that depend on actual PMEM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes "make test" depend on sudo, which is not currently a requirement and also shouldn't be. Developers should be able to build and test on a machine with a user account that isn't allowed to become root.

To avoid this root user. I initially tried ruining these tests in a container (#420 (comment)).

The other issue is that the test depends on /dev/loop-control, right?

Let me propose a different approach to running tests that need to be run in a certain environment:

  • in the test code, check those preconditions (are root, have /dev/loop-control): if not met, skip the test

  • integrate the running of unit tests into our E2E testing:

    • build test binaries with go test -c
    • copy test binaries into the first worker
    • run them there as root

This is more complicated, but it also gives us better test coverage, because then operations that may depend on the OS kernel of the cluster node actually get tested with that kernel. We can also do unit tests that depend on actual PMEM.

Ok, Instead of running these(device manager) unit tests in full(3-node) test cluster, I would propose to run them in a single virtual machine node. I pushed changes to support this: 45f21a6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single node vm installs only dependencies needed for PMEM, such as LVM2, ndctl. No other packages like Docker, Kubernetes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Instead of running these(device manager) unit tests in full(3-node) test cluster, I would propose to run them in a single virtual machine node.

That's a massive change to some already pretty complex scripts. What's the advantage compared to just using the normal VM cluster?

We need to consider two different use cases:

  • a developer using this while making changes to the source code
  • the CI jobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a massive change to some already pretty complex scripts.

No, it's not that massive change as it looks in the PR. Most of the changes are indentations. It's just replacing few bits in the right place in test/start-kubernetes.sh script.

Defined package level error codes. The implementations are supposed to return
these error codes. And also revised the error message formatting and wrapping
using go-1.13 fmt package.
Discussion on issue intel#413, raised a requirement that PMEM-CSI should handle the
DeleteVolume() call for volumes that the underlined device already deleted
manually by the admin.  So for such a volume we just ignore and return no error
in DeviceManager.DeleteVolume(). This also makes DeleteVolume() idempotent.
@pohly
Copy link
Contributor

pohly commented Nov 21, 2019

@avalluri: so what did you change to fix the failure? It's a bit hard to tell because of the rebase. Is it 8b22191 and thus #472?

@avalluri
Copy link
Contributor Author

avalluri commented Nov 21, 2019

@avalluri: so what did you change to fix the failure? It's a bit hard to tell because of the rebase. Is it 8b22191 and thus #472?

I am about to mention that :) Yes, This should fix.

@avalluri
Copy link
Contributor Author

I believe it is better to run this test inside the existing steps

If this what you feel right, we can still run using make run_dm_tests in CI, then it picks the first node in the default cluster to run device manager tets.

@avalluri
Copy link
Contributor Author

@pohly I took your change pohly@a9e7136 and squashed to ff07726. Thanks.

@avalluri
Copy link
Contributor Author

DestroyNamespace will first deactivate a namespace, then destroy it. Suppose the driver crashes or gets killed after the first and before the second step. We are left with an inactive namespace that shouldn't be ignored.

Interesting question. Ya, may be true. We cannot ignore that namespace in this case.

@avalluri
Copy link
Contributor Author

DestroyNamespace will first deactivate a namespace, then destroy it. Suppose the driver crashes or gets killed after the first and before the second step. We are left with an inactive namespace that shouldn't be ignored.

Interesting question. Ya, may be true. We cannot ignore that namespace in this case.

By the way, we already dealing with only active namespaces for ListDevices, this case affects this also.

@avalluri
Copy link
Contributor Author

There is something weird going on in CI, 1-16 Direct tests running for 2h 28m but no results yet. Somehow got stuck. I am not sure if the issue with the code or something else.

@pohly
Copy link
Contributor

pohly commented Nov 21, 2019

Interesting question. Ya, may be true. We cannot ignore that namespace in this case.

So what do you think of the solution I have here: 45f7c42#diff-b99dbbc6a60be19c9eacba35ad8a5526

By the way, we already dealing with only active namespaces for ListDevices, this case affects this also.

What do we need to fix for ListDevices? Should it report inactive, non-empty namespaces and currently ignores them?

@pohly
Copy link
Contributor

pohly commented Nov 21, 2019

There is something weird going on in CI, 1-16 Direct tests running for 2h 28m but no results yet. Somehow got stuck. I am not sure if the issue with the code or something else.

You could try running the E2E tests in that mode locally.

@avalluri
Copy link
Contributor Author

So what do you think of the solution I have here: 45f7c42#diff-b99dbbc6a60be19c9eacba35ad8a5526

Should work, I couldn't see any issues with that.

What do we need to fix for ListDevices? Should it report inactive, non-empty namespaces and currently ignores them?

Yes, it also should have that fix. Maybe we better fix this ndctl layer.

@@ -93,7 +93,7 @@ func (r *Region) ActiveNamespaces() []*Namespace {
return r.namespaces(true)
}

//AllNamespaces returns all namespaces in the region
//AllNamespaces returns all non-zero sized namespaces in the region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here explaining why it ignores namespaces with zero size? I know its in the commit message, but that's not necessarily where someone will look when reading this description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if !onlyActive || C.ndctl_namespace_is_active(ndns) == true {
namespaces = append(namespaces, (*Namespace)(ndns))
ns := (*Namespace)(ndns)
if (onlyActive && ns.Active()) || ns.Size() != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break the semantic of onlyActive?

If it is true, then inactive namespaces are returned if their size is > 0, which is not what the caller asked for, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, Fixed it now. Thanks.

@pohly
Copy link
Contributor

pohly commented Nov 21, 2019

@avalluri: Did you also make some more changes to the Jenkinsfile? It's no longer valid: https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/organizations/jenkins/pmem-csi/detail/PR-420/44/pipeline

@avalluri avalluri force-pushed the devicemanager branch 2 times, most recently from 63ca968 to c3ef004 Compare November 22, 2019 04:07
…paces

While deleting a namespace ndctl might still keeep the namespace inactive
and with size zero. And also there might be possible that the unsuccessful
deletes via Region.DestroyNamespace() might left the namespace disabled but not
deleted, so we should not ignore them too.

This change, considers the size of the found namespace and considers only if
it's non-zero.
Tests for both device managers: direct and LVM.

LVM tests used file backed loop device to mimic an logical volume group, This
requires /dev/loop-control file on host where the tests are run.

Intentionally disable running these device manager unit tests as it requires sudo/root user
permission.
Device manager tests need root/sudo user permission, and loop control device on
host where they run. So to avoid this we run these tests under a new virtual
machine.

Added new make target 'run_dm_tests' which sets up pmem-csi-${CLUSTER}-worker1
node if not exists and runs the device manager unit tests in it.

To make use of existing test cluster scripts, made needed changes to them to
support creating a single pmem node.
${env.BUILD_IMAGE} bash -c 'set -x; \
swupd bundle-add openssh-client && \
make run_dm_tests; \
make stop'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is now treated as success even if make run_dm_tests fails. As I intend to remove this step in #473, we can merge this PR without fixing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants