Misc NDM Improvement #55

Vicente-Cheng · 2023-03-27T13:48:35Z

Problem:

Now we find out that NDM will remove the hot-plug device from the blockdevices that would cause the data to be lost when it add back.
Also, we allow the user to use the repair tool instead of the force format again.
Some improvement

Solution:

We changed the state to inactive for the hotplugged device instead of deleting it from blockdevices CR.
Add new field corrupted. For the corrupted device, we could choose the force format or try to use e2fsck to repair.
Log improvement
reduce the redundant CR update
remove the polling mechanism (But still keep two handling function now)

Related Issue:
harvester/harvester#3218
harvester/harvester#3412
harvester/harvester#4016

Test plan:
case 1:

add a single disk (should be external storage, that we could hot-plug)
remove this disk
check the blockdevices status is inactive
add this disk back
check that the blockdevices status is active and should work again.

case 2:
NOTE1: Please update the blockdevices CRD (on this PR) before we do this test
NOTE2: Please ensure you use the SCSI disk with wwn or the filesystem UUID will be erased when we try to break the filesystem.

add a single disk
create external storage with the above disk
remove this storage
make this filesystem corrupted (you could use the following command dd if=/dev/zero of=/dev/sda bs=512k count=1 oflag=sync
try to create external storage with this disk
you should see the mount error and check the Status.Filesystem.Corrupted is true
Check on the UI that you could force format again.
Add it back with force format, and it should work again.

bk201 · 2023-07-03T06:45:57Z

Update
False alarm; there are duplicated WWN in the testing disks. Please ignore the following comment.

Try to attach some disks in a very short time, but only some are created:

for ((i=0;i<10;i++)); do ./attach-disk.sh node2 ; done

Block devices (please ignore node3)

NAMESPACE         NAME                               TYPE   DEVPATH    MOUNTPOINT   NODENAME   STATUS   AGE
longhorn-system   5819b466585e1e3b1c9ed398918a9cb1   disk   /dev/sdi                node2      Active   5m4s
longhorn-system   8a655634522b710cc4f2d3514f10b5c3   disk   /dev/sdc                node3      Active   10m
longhorn-system   91c3333ea91016f699f1641ba3228354   disk   /dev/sda                node2      Active   5m11s
longhorn-system   a7f262a2e9610974dd9904ee577f7515   disk   /dev/sdf                node2      Active   5m8s
longhorn-system   b3d51f437d5c71e7476929625234549e   disk   /dev/sdd                node3      Active   10m
longhorn-system   d390352fbf27678e80eed80342918336   disk   /dev/sde                node3      Active   10m

Virtual disks:

node2:~ # ls -alh /dev/sd*
brw-rw---- 1 root disk 8,   0 Jul  3 06:38 /dev/sda
brw-rw---- 1 root disk 8,  16 Jul  3 06:24 /dev/sdb
brw-rw---- 1 root disk 8,  32 Jul  3 06:38 /dev/sdc
brw-rw---- 1 root disk 8,  48 Jul  3 06:38 /dev/sdd
brw-rw---- 1 root disk 8,  64 Jul  3 06:38 /dev/sde
brw-rw---- 1 root disk 8,  80 Jul  3 06:38 /dev/sdf
brw-rw---- 1 root disk 8,  96 Jul  3 06:38 /dev/sdg
brw-rw---- 1 root disk 8, 112 Jul  3 06:38 /dev/sdh
brw-rw---- 1 root disk 8, 128 Jul  3 06:38 /dev/sdi
brw-rw---- 1 root disk 8, 144 Jul  3 06:38 /dev/sdj
brw-rw---- 1 root disk 8, 160 Jul  3 06:38 /dev/sdk

NDM log:
harvester-node-disk-manager-4v85m.log

pkg/apis/harvesterhci.io/v1beta1/types.go

- That would be give a chance to run e2fsck - Also change the DevPath to the current host path Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- now we are triggered by uevent - minor log improvement Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- SingleDisk: autoprovision, unprovision, provision - Hotplug: add, remove Some minor changes including: - run ci-integration test directly - improve debug information when test fail Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- golang to v1.19 - wrangler to v1.1.1 - client-go to v0.24.13 - longhorn to v1.4.2, also bump to v1beta2 api - golangci-lint to v1.52.0 Also, minor fix the ci complain Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- That would be helpful for debugging - Add more check for integration tests - Using longhorn v1.4.2 Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- Wraper the cond lock for caller - Use block channel for waiting scanner stop - Improve the log message with uevent Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- now some functions are replaced on standard library `os` Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

We need to check the `DiskAddedToNode` with the provision device because we may fail to update the device. That means the device status is not `Provisioned`, but the LH node already has the disk. We would not do the next update to make the device `Provisioned`. Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- Add xml sample to let user to add disk w/ WWN Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- change json format to yaml - skip G204 (gosec) - fix others lint errors Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

- Add upgrade test on CI. The detail steps as below 1. Install latest version 2. Add disk 3. Upgrade to the develping image 4. Run integration test Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

bk201

LGTM

Vicente-Cheng force-pushed the wip-NDM-improvement branch 12 times, most recently from f45112c to 7311c01 Compare March 28, 2023 11:20

Vicente-Cheng changed the title ~~Wip ndm improvement~~ Misc NDM Improvement Mar 28, 2023

Vicente-Cheng force-pushed the wip-NDM-improvement branch 3 times, most recently from 7356be2 to cd3680d Compare March 28, 2023 15:47

Vicente-Cheng force-pushed the wip-NDM-improvement branch from cd3680d to ff66ddc Compare April 6, 2023 03:33

Vicente-Cheng marked this pull request as ready for review April 6, 2023 03:36

Vicente-Cheng force-pushed the wip-NDM-improvement branch 8 times, most recently from b267e2f to 4f9b796 Compare April 7, 2023 07:21

Vicente-Cheng force-pushed the wip-NDM-improvement branch from 4f9b796 to 0c90444 Compare April 21, 2023 02:41

guangbochen requested review from guangbochen, bk201 and ibrokethecloud May 16, 2023 09:46

Vicente-Cheng force-pushed the wip-NDM-improvement branch 7 times, most recently from 67c0b8e to 3136f34 Compare July 3, 2023 06:01

Vicente-Cheng force-pushed the wip-NDM-improvement branch from 3136f34 to 4305d57 Compare July 3, 2023 07:23

bk201 reviewed Jul 3, 2023

View reviewed changes

pkg/apis/harvesterhci.io/v1beta1/types.go Outdated Show resolved Hide resolved

Vicente-Cheng added 14 commits July 3, 2023 16:19

CRD: add corrupted status

df0b3f8

- That would be give a chance to run e2fsck - Also change the DevPath to the current host path Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

controller: add corrupted disk handling

ce2537d

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

controller: remove scanner polling

011065c

- now we are triggered by uevent - minor log improvement Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

ci: add single disk and hotplug test

88c6481

- SingleDisk: autoprovision, unprovision, provision - Hotplug: add, remove Some minor changes including: - run ci-integration test directly - improve debug information when test fail Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

bump golang, wrangler and client-go

2d6e13f

- golang to v1.19 - wrangler to v1.1.1 - client-go to v0.24.13 - longhorn to v1.4.2, also bump to v1beta2 api - golangci-lint to v1.52.0 Also, minor fix the ci complain Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

CRD: add repaired to FileSystem Spec

11e3ddd

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

ci: Always get NDM log

22adf70

- That would be helpful for debugging - Add more check for integration tests - Using longhorn v1.4.2 Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

controller/node: correct the checking condition when OnRemove

cb6307d

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

common: rename util to utils

92b5d07

- Wraper the cond lock for caller - Use block channel for waiting scanner stop - Improve the log message with uevent Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

common: remove the deprecated package io/ioutil

80a8731

- now some functions are replaced on standard library `os` Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

README: add Appendix for libvirt disk

23a1460

- Add xml sample to let user to add disk w/ WWN Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

golangci: add gosec to linter

f3dfff4

- change json format to yaml - skip G204 (gosec) - fix others lint errors Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

ci: add upgrade

73f17bd

- Add upgrade test on CI. The detail steps as below 1. Install latest version 2. Add disk 3. Upgrade to the develping image 4. Run integration test Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>

Vicente-Cheng force-pushed the wip-NDM-improvement branch from 4305d57 to 73f17bd Compare July 3, 2023 08:19

bk201 approved these changes Jul 3, 2023

View reviewed changes

Vicente-Cheng merged commit c0b0b7f into harvester:master Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc NDM Improvement #55

Misc NDM Improvement #55

Vicente-Cheng commented Mar 27, 2023 •

edited

Loading

bk201 commented Jul 3, 2023 •

edited

Loading

bk201 left a comment

Misc NDM Improvement #55

Misc NDM Improvement #55

Conversation

Vicente-Cheng commented Mar 27, 2023 • edited Loading

bk201 commented Jul 3, 2023 • edited Loading

bk201 left a comment

Choose a reason for hiding this comment

Vicente-Cheng commented Mar 27, 2023 •

edited

Loading

bk201 commented Jul 3, 2023 •

edited

Loading