DAOS-7177 control: report nvme result from storage prepare #5350

tanabarr · 2021-04-06T22:47:29Z

Update dmg storage prepare command output to display results of SCM and
NVMe prepare operations in a way that will illustrate which actions were
performed in the cases that either one or the other failed and if only
one or the other was requested. If a particular host fails NVMe prepare
but succeeds SCM prepare for example, a host error will be displayed for
the NVMe failure and the coalesced result table will contain a separate
entry for the failed host indicating SCM success but no NVMe result.

Update hashstructure package to v2 in go.mod and update vendor directory.

Signed-off-by: Tom Nabarro tom.nabarro@intel.com

daosbuild1 · 2021-04-06T22:48:57Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/1/execution/node/75/log

daosbuild1

LGTM. No errors found by checkpatch.

kjacque

Small comments. It wasn't clear to me - wasn't the initial PR this was based on landed? Maybe I'm just confused but it looked like there were repeats of some of those commits showing up in the diff even though you changed the base.

src/control/cmd/dmg/pretty/storage_test.go

src/control/server/ctl_storage_rpc.go

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1

LGTM. No errors found by checkpatch.

tanabarr · 2021-04-08T22:42:47Z

Small comments. It wasn't clear to me - wasn't the initial PR this was based on landed? Maybe I'm just confused but it looked like there were repeats of some of those commits showing up in the diff even though you changed the base.

Right I had to merge with master after the parent Branch landed, Ended up as quite a big change but the result is more reliable and the results differentiate between independently failing NVMe and SCM results on the same host during storage prepare, see the pretty printer tests for examples

daosbuild1 · 2021-04-08T22:44:02Z

Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/459/log

daosbuild1 · 2021-04-08T22:44:09Z

Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/396/log

daosbuild1 · 2021-04-08T22:44:23Z

Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/454/log

daosbuild1 · 2021-04-08T22:45:34Z

Test stage Build on CentOS 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/465/log

daosbuild1 · 2021-04-08T22:46:02Z

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/342/log

daosbuild1 · 2021-04-08T22:46:11Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/358/log

daosbuild1 · 2021-04-08T22:47:16Z

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/466/log

tanabarr · 2021-04-08T22:49:30Z

example: https://github.com/daos-stack/daos/pull/5350/files#diff-c5afc62b3182cbf5244a7aff179ae26e42f4c7975ac312746217a55c9ee78d07R857

daosbuild1 · 2021-04-08T22:55:39Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/5/execution/node/80/log

daosbuild1 · 2021-04-09T10:31:53Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/6/execution/node/80/log

tanabarr · 2021-04-09T13:15:39Z

Check patch failure identified here: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-5350/6/pipeline/43#step-80-log-48 concerns a spelling mistake in a vendored dependency "hashstructure" and should not be fixed in this PR.

tanabarr · 2021-04-09T13:39:31Z

I initially thought about using the request flags to filter out the components of the response to print, but I don't want to rely on the correctness of the flags sent to the server (i.e. if some reason the server prepares SCM then I want to know about it, regardless of what was passed to it in the request, and yes the argument can be made that we should be able to operate with the assumption that the behaviour will be correct and the server will only ever perform what it is requested to do, but I would prefer to display the results that are returned Rather than mask based on what was requested)

in the specific case of prepare and scm/nvme results, if we have multiple hosts and on one of those hosts SCM succeeds but NVMe fails (whereas on the rest of the hosts both succeed), we will have a host failure that can be printed in a separate table but to correlate the NVMe failure with the host storage part of the response we need to return a differentiated value for NvmeDevices (otherwise the aggregated results will not reflect the failure),hence needing the ability to differentiate between a nil and empty slice when hashing.

daosbuild1 · 2021-04-10T00:16:50Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/7/execution/node/80/log

tanabarr · 2021-04-19T10:22:21Z

@mjmac I don't think the proposal to ...make the "prepare SCM" and "prepare NVMe" operations distinct at the API/RPC level will resolve the relevant issue.

The crux of the issue that this PR attempts to resolve is that when building the HostStorageMap that represents groupings of hosts that have responded with identical results, there is no distinction between a host that failed NVMe prepare and one that didn't.

For example even if we run separate RPCs for NVMe we still need to combine the response results and add to the HostStorageMap:

host1 returns HS with ScmNamespaces: S1 after prepare SCM success
host1 then returns empty HS from prepare NVMe after success
host2 returns HS with ScmNamespaces: S1 after prepare SCM success
host2 then returns empty HS from prepare NVMe after failure with an entry in HEM

In this scenario, regardless of how you combine the results, both HostStorage structs to be added to the Map will hash to the same key value because no hashable HostStorage field is altered to communicate the NVMe failure.

This can be resolved in a number of ways:

Existing solution where HostStorage fields are changed to references (to slice aliases) and NVMe prepare failure can be represented by nil and success by empty slice (the difference between which results in distinct hash digests).
Postprocessing HostStorageMap to move entries correlating to HostErrorMap entries into a distinct group for NVMe failures (IMO would look very ugly in the control API).
Similar to above but perform postprocessing in pretty printer; alter display of results by extracting entries from HostStorageMap and correlating with HostErrorsMap to reflect the distinction of hosts that have failed NVMe prepare, will require typing/recognition of NVMe specific errors (ugliness would be confined to the DMG layer).
Add another member to the HostStorage struct which indicates NVMe prepare result explicitly (similar to "reboot required"), probably the most "correct" way to do it in a number of senses but we are adding a storage prepare specific field to the generic HostStorage type.

Which route would you like me to follow? I don't personally see a particular disadvantage (in comparison to the others) with regard to the currently implemented solution.

…trol-scan-nvme-summary

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2021-04-19T18:06:56Z

Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/10/execution/node/910/log

daosbuild1 · 2021-04-20T08:00:05Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/11/execution/node/107/log

Update dmg storage prepare command output to display results of SCM and NVMe prepare operations in a way that will illustrate which actions were performed in the cases that either one or the other failed and if only one or the other was requested. If a particular host fails NVMe prepare but succeeds SCM prepare for example, a host error will be displayed for the NVMe failure and the coalesced result table will contain a separate entry for the failed host indicating SCM success but no NVMe result. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 · 2021-04-20T08:05:13Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/12/execution/node/105/log

daosbuild1 · 2021-04-20T08:54:39Z

Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/12/execution/node/863/log

kjacque

It seems basically sane to me, but one thought I have is how stable should we expect the control API to be at this point? Maybe it would be worthwhile to create a different prepare result map, that could include errors for individual SSD components. The fact that it's hard to shoehorn all we need into the HostStorageMap without some subtlety like this, nil vs. empty, could indicate the use case is better served by a new type.

If you fix the unit test failures, I could approve as-is, but it might be better to step back and see if creating a new type reduces or removes the problems we've gone back and forth on.

kjacque · 2021-04-20T20:51:12Z

src/control/lib/control/storage.go

@@ -34,19 +34,19 @@ var storageHashOpts = hashstructure.HashOptions{
 type HostStorage struct {
 	// NvmeDevices contains the set of NVMe controllers (SSDs)
 	// in this configuration.
-	NvmeDevices storage.NvmeControllers `json:"nvme_devices"`
+	NvmeDevices *storage.NvmeControllers `json:"nvme_devices"`


These types are slices, right? Isn't it already possible for slices to be nil?

yes but hashstructure doesn't distinguish between a typed nil slice and an empty slice when generating a hash digest, so in order to generate distinct keys in the HSM I needed to change the field to a pointer, in which case a nil pointer could be distinguished from a pointer to an empty slice (for failure and success respectively) which would each generate a distinct key for the map when hashed

This is the crux of the issue I am having with this PR. Because a third-party library doesn't behave the way we want it to in order to handle this narrow use case, we're making changes all throughout our codebase? What if the next version fixes it, or otherwise changes behavior? Will we have more ripples throughout our code to accommodate that?

Given that the code is open source under a permissive license, I think that we could fork it, fix it, use it and submit the fix upstream. I suspect that they would accept a contribution to fix this because after studying the code I think it's an oversight rather than a deliberate design decision.

Alternatively, we could use this as an opportunity to rethink what we're doing here. I'm still not convinced that we need to perform remote NVMe prepare/reset. The server already does an automatic prepare on startup. This seems to me like we're investing a lot of time and energy on solving a problem with limited application.

tanabarr · 2021-04-21T09:51:58Z

It seems basically sane to me, but one thought I have is how stable should we expect the control API to be at this point? Maybe it would be worthwhile to create a different prepare result map, that could include errors for individual SSD components. The fact that it's hard to shoehorn all we need into the HostStorageMap without some subtlety like this, nil vs. empty, could indicate the use case is better served by a new type.

If you fix the unit test failures, I could approve as-is, but it might be better to step back and see if creating a new type reduces or removes the problems we've gone back and forth on.

have had a fair amount of discussion with @mjmac on the topic and he is considering his preferred approach

…trol-dmg-prep-results

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 · 2021-04-21T11:05:56Z

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/13/execution/node/123/log

daosbuild1

LGTM. No errors found by checkpatch.

…trol-dmg-prep-results

daosbuild1

LGTM. No errors found by checkpatch.

daosbuild1 · 2021-04-25T11:37:04Z

Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/15/execution/node/889/log

…trol-dmg-prep-results

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1

LGTM. No errors found by checkpatch.

tanabarr requested review from mjmac and kjacque April 6, 2021 22:47

Base automatically changed from tanabarr/control-clean-hugepages to master April 7, 2021 15:16

daosbuild1 reviewed Apr 7, 2021

View reviewed changes

kjacque reviewed Apr 8, 2021

View reviewed changes

src/control/cmd/dmg/pretty/storage_test.go Outdated Show resolved Hide resolved

src/control/server/ctl_storage_rpc.go Outdated Show resolved Hide resolved

src/control/server/ctl_storage_rpc.go Outdated Show resolved Hide resolved

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from 057f198 to 54f2de8 Compare April 8, 2021 22:35

daosbuild1 reviewed Apr 8, 2021

View reviewed changes

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from 54f2de8 to bf71fcf Compare April 8, 2021 22:38

daosbuild1 reviewed Apr 8, 2021

View reviewed changes

tanabarr requested a review from kjacque April 8, 2021 22:45

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from bf71fcf to 06e4c0c Compare April 8, 2021 22:54

kjacque previously approved these changes Apr 9, 2021

View reviewed changes

tanabarr dismissed kjacque’s stale review via 6705864 April 9, 2021 10:30

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from 06e4c0c to 6705864 Compare April 9, 2021 10:30

tanabarr requested a review from kjacque April 9, 2021 10:32

tanabarr dismissed kjacque’s stale review via 508ee4d April 19, 2021 10:50

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from f3c5d88 to 508ee4d Compare April 19, 2021 10:50

Merge branch 'master' of github.com:daos-stack/daos into tanabarr/con…

2ee4edd

…trol-scan-nvme-summary

daosbuild1 reviewed Apr 19, 2021

View reviewed changes

tanabarr changed the base branch from master to tanabarr/control-scan-nvme-summary April 20, 2021 07:55

tanabarr force-pushed the tanabarr/control-dmg-prep-results branch from 508ee4d to c9878b8 Compare April 20, 2021 08:03

tanabarr requested review from mjmac and kjacque April 20, 2021 08:04

kjacque reviewed Apr 20, 2021

View reviewed changes

tanabarr added 2 commits April 21, 2021 11:50

Merge branch 'master' of github.com:daos-stack/daos into tanabarr/con…

3615cd0

…trol-dmg-prep-results

fix dmg/pretty unit tests

8c06d03

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

Base automatically changed from tanabarr/control-scan-nvme-summary to master April 22, 2021 19:45

daosbuild1 reviewed Apr 22, 2021

View reviewed changes

Merge branch 'master' of github.com:daos-stack/daos into tanabarr/con…

9f19ba1

…trol-dmg-prep-results

daosbuild1 reviewed Apr 25, 2021

View reviewed changes

tanabarr added 2 commits April 30, 2021 13:20

Merge branch 'master' of github.com:daos-stack/daos into tanabarr/con…

2e84d64

…trol-dmg-prep-results

fix unit test failures

f10329e

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

daosbuild1 reviewed Apr 30, 2021

View reviewed changes

tanabarr requested a review from a team as a code owner February 22, 2022 21:50

tanabarr requested a review from a team as a code owner July 24, 2022 21:17

kccain removed the request for review from a team September 6, 2022 19:14

DAOS-7177 control: report nvme result from storage prepare #5350

Are you sure you want to change the base?

DAOS-7177 control: report nvme result from storage prepare #5350

Conversation

tanabarr commented Apr 6, 2021 • edited Loading

daosbuild1 commented Apr 6, 2021

daosbuild1 left a comment

Choose a reason for hiding this comment

kjacque left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

tanabarr commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

tanabarr commented Apr 8, 2021

daosbuild1 commented Apr 8, 2021

daosbuild1 commented Apr 9, 2021

tanabarr commented Apr 9, 2021 • edited Loading

tanabarr commented Apr 9, 2021

daosbuild1 commented Apr 10, 2021

tanabarr commented Apr 19, 2021 • edited Loading

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Apr 19, 2021

daosbuild1 commented Apr 20, 2021

daosbuild1 commented Apr 20, 2021

daosbuild1 commented Apr 20, 2021

kjacque left a comment

Choose a reason for hiding this comment

kjacque Apr 20, 2021

Choose a reason for hiding this comment

tanabarr Apr 21, 2021

Choose a reason for hiding this comment

mjmac Apr 21, 2021

Choose a reason for hiding this comment

tanabarr commented Apr 21, 2021

daosbuild1 commented Apr 21, 2021

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 left a comment

Choose a reason for hiding this comment

daosbuild1 commented Apr 25, 2021

daosbuild1 left a comment

Choose a reason for hiding this comment

tanabarr commented Apr 6, 2021 •

edited

Loading

tanabarr commented Apr 9, 2021 •

edited

Loading

tanabarr commented Apr 19, 2021 •

edited

Loading