-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-7177 control: report nvme result from storage prepare #5350
base: master
Are you sure you want to change the base?
Conversation
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/1/execution/node/75/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comments. It wasn't clear to me - wasn't the initial PR this was based on landed? Maybe I'm just confused but it looked like there were repeats of some of those commits showing up in the diff even though you changed the base.
057f198
to
54f2de8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
54f2de8
to
bf71fcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Right I had to merge with master after the parent Branch landed, Ended up as quite a big change but the result is more reliable and the results differentiate between independently failing NVMe and SCM results on the same host during storage prepare, see the pretty printer tests for examples |
Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/459/log |
Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/396/log |
Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/454/log |
Test stage Build on CentOS 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/465/log |
Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/342/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/358/log |
Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/4/execution/node/466/log |
bf71fcf
to
06e4c0c
Compare
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/5/execution/node/80/log |
06e4c0c
to
6705864
Compare
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/6/execution/node/80/log |
Check patch failure identified here: https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-5350/6/pipeline/43#step-80-log-48 concerns a spelling mistake in a vendored dependency "hashstructure" and should not be fixed in this PR. |
I initially thought about using the request flags to filter out the components of the response to print, but I don't want to rely on the correctness of the flags sent to the server (i.e. if some reason the server prepares SCM then I want to know about it, regardless of what was passed to it in the request, and yes the argument can be made that we should be able to operate with the assumption that the behaviour will be correct and the server will only ever perform what it is requested to do, but I would prefer to display the results that are returned Rather than mask based on what was requested) in the specific case of prepare and scm/nvme results, if we have multiple hosts and on one of those hosts SCM succeeds but NVMe fails (whereas on the rest of the hosts both succeed), we will have a host failure that can be printed in a separate table but to correlate the NVMe failure with the host storage part of the response we need to return a differentiated value for NvmeDevices (otherwise the aggregated results will not reflect the failure),hence needing the ability to differentiate between a nil and empty slice when hashing. |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/7/execution/node/80/log |
@mjmac I don't think the proposal to The crux of the issue that this PR attempts to resolve is that when building the HostStorageMap that represents groupings of hosts that have responded with identical results, there is no distinction between a host that failed NVMe prepare and one that didn't. For example even if we run separate RPCs for NVMe we still need to combine the response results and add to the HostStorageMap:
In this scenario, regardless of how you combine the results, both HostStorage structs to be added to the Map will hash to the same key value because no hashable HostStorage field is altered to communicate the NVMe failure. This can be resolved in a number of ways:
Which route would you like me to follow? I don't personally see a particular disadvantage (in comparison to the others) with regard to the currently implemented solution. |
f3c5d88
to
508ee4d
Compare
…trol-scan-nvme-summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/10/execution/node/910/log |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/11/execution/node/107/log |
Update dmg storage prepare command output to display results of SCM and NVMe prepare operations in a way that will illustrate which actions were performed in the cases that either one or the other failed and if only one or the other was requested. If a particular host fails NVMe prepare but succeeds SCM prepare for example, a host error will be displayed for the NVMe failure and the coalesced result table will contain a separate entry for the failed host indicating SCM success but no NVMe result. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
508ee4d
to
c9878b8
Compare
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/12/execution/node/105/log |
Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/12/execution/node/863/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems basically sane to me, but one thought I have is how stable should we expect the control API to be at this point? Maybe it would be worthwhile to create a different prepare result map, that could include errors for individual SSD components. The fact that it's hard to shoehorn all we need into the HostStorageMap without some subtlety like this, nil vs. empty, could indicate the use case is better served by a new type.
If you fix the unit test failures, I could approve as-is, but it might be better to step back and see if creating a new type reduces or removes the problems we've gone back and forth on.
@@ -34,19 +34,19 @@ var storageHashOpts = hashstructure.HashOptions{ | |||
type HostStorage struct { | |||
// NvmeDevices contains the set of NVMe controllers (SSDs) | |||
// in this configuration. | |||
NvmeDevices storage.NvmeControllers `json:"nvme_devices"` | |||
NvmeDevices *storage.NvmeControllers `json:"nvme_devices"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types are slices, right? Isn't it already possible for slices to be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but hashstructure doesn't distinguish between a typed nil slice and an empty slice when generating a hash digest, so in order to generate distinct keys in the HSM I needed to change the field to a pointer, in which case a nil pointer could be distinguished from a pointer to an empty slice (for failure and success respectively) which would each generate a distinct key for the map when hashed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux of the issue I am having with this PR. Because a third-party library doesn't behave the way we want it to in order to handle this narrow use case, we're making changes all throughout our codebase? What if the next version fixes it, or otherwise changes behavior? Will we have more ripples throughout our code to accommodate that?
Given that the code is open source under a permissive license, I think that we could fork it, fix it, use it and submit the fix upstream. I suspect that they would accept a contribution to fix this because after studying the code I think it's an oversight rather than a deliberate design decision.
Alternatively, we could use this as an opportunity to rethink what we're doing here. I'm still not convinced that we need to perform remote NVMe prepare/reset. The server already does an automatic prepare on startup. This seems to me like we're investing a lot of time and energy on solving a problem with limited application.
have had a fair amount of discussion with @mjmac on the topic and he is considering his preferred approach |
…trol-dmg-prep-results
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/13/execution/node/123/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
…trol-dmg-prep-results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-5350/15/execution/node/889/log |
…trol-dmg-prep-results
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Update dmg storage prepare command output to display results of SCM and
NVMe prepare operations in a way that will illustrate which actions were
performed in the cases that either one or the other failed and if only
one or the other was requested. If a particular host fails NVMe prepare
but succeeds SCM prepare for example, a host error will be displayed for
the NVMe failure and the coalesced result table will contain a separate
entry for the failed host indicating SCM success but no NVMe result.
Update hashstructure package to v2 in go.mod and update vendor directory.
Signed-off-by: Tom Nabarro tom.nabarro@intel.com