Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc NDM Improvement #55

Merged
merged 18 commits into from
Jul 3, 2023

Conversation

Vicente-Cheng
Copy link
Collaborator

@Vicente-Cheng Vicente-Cheng commented Mar 27, 2023

Problem:

  1. Now we find out that NDM will remove the hot-plug device from the blockdevices that would cause the data to be lost when it add back.
  2. Also, we allow the user to use the repair tool instead of the force format again.
  3. Some improvement

Solution:

  1. We changed the state to inactive for the hotplugged device instead of deleting it from blockdevices CR.
  2. Add new field corrupted. For the corrupted device, we could choose the force format or try to use e2fsck to repair.
  3. Log improvement
  4. reduce the redundant CR update
  5. remove the polling mechanism (But still keep two handling function now)

Related Issue:
harvester/harvester#3218
harvester/harvester#3412
harvester/harvester#4016

Test plan:
case 1:

  1. add a single disk (should be external storage, that we could hot-plug)
  2. remove this disk
  3. check the blockdevices status is inactive
  4. add this disk back
  5. check that the blockdevices status is active and should work again.

case 2:
NOTE1: Please update the blockdevices CRD (on this PR) before we do this test
NOTE2: Please ensure you use the SCSI disk with wwn or the filesystem UUID will be erased when we try to break the filesystem.

  1. add a single disk
  2. create external storage with the above disk
  3. remove this storage
  4. make this filesystem corrupted (you could use the following command dd if=/dev/zero of=/dev/sda bs=512k count=1 oflag=sync
  5. try to create external storage with this disk
  6. you should see the mount error and check the Status.Filesystem.Corrupted is true
  7. Check on the UI that you could force format again.
  8. Add it back with force format, and it should work again.

@Vicente-Cheng Vicente-Cheng force-pushed the wip-NDM-improvement branch 12 times, most recently from f45112c to 7311c01 Compare March 28, 2023 11:20
@Vicente-Cheng Vicente-Cheng changed the title Wip ndm improvement Misc NDM Improvement Mar 28, 2023
@Vicente-Cheng Vicente-Cheng force-pushed the wip-NDM-improvement branch 3 times, most recently from 7356be2 to cd3680d Compare March 28, 2023 15:47
@Vicente-Cheng Vicente-Cheng marked this pull request as ready for review April 6, 2023 03:36
@Vicente-Cheng Vicente-Cheng force-pushed the wip-NDM-improvement branch 8 times, most recently from b267e2f to 4f9b796 Compare April 7, 2023 07:21
@Vicente-Cheng Vicente-Cheng force-pushed the wip-NDM-improvement branch 7 times, most recently from 67c0b8e to 3136f34 Compare July 3, 2023 06:01
@bk201
Copy link
Member

bk201 commented Jul 3, 2023

Update
False alarm; there are duplicated WWN in the testing disks. Please ignore the following comment.

Try to attach some disks in a very short time, but only some are created:

for ((i=0;i<10;i++)); do ./attach-disk.sh node2 ; done

Block devices (please ignore node3)

NAMESPACE         NAME                               TYPE   DEVPATH    MOUNTPOINT   NODENAME   STATUS   AGE
longhorn-system   5819b466585e1e3b1c9ed398918a9cb1   disk   /dev/sdi                node2      Active   5m4s
longhorn-system   8a655634522b710cc4f2d3514f10b5c3   disk   /dev/sdc                node3      Active   10m
longhorn-system   91c3333ea91016f699f1641ba3228354   disk   /dev/sda                node2      Active   5m11s
longhorn-system   a7f262a2e9610974dd9904ee577f7515   disk   /dev/sdf                node2      Active   5m8s
longhorn-system   b3d51f437d5c71e7476929625234549e   disk   /dev/sdd                node3      Active   10m
longhorn-system   d390352fbf27678e80eed80342918336   disk   /dev/sde                node3      Active   10m

Virtual disks:

node2:~ # ls -alh /dev/sd*
brw-rw---- 1 root disk 8,   0 Jul  3 06:38 /dev/sda
brw-rw---- 1 root disk 8,  16 Jul  3 06:24 /dev/sdb
brw-rw---- 1 root disk 8,  32 Jul  3 06:38 /dev/sdc
brw-rw---- 1 root disk 8,  48 Jul  3 06:38 /dev/sdd
brw-rw---- 1 root disk 8,  64 Jul  3 06:38 /dev/sde
brw-rw---- 1 root disk 8,  80 Jul  3 06:38 /dev/sdf
brw-rw---- 1 root disk 8,  96 Jul  3 06:38 /dev/sdg
brw-rw---- 1 root disk 8, 112 Jul  3 06:38 /dev/sdh
brw-rw---- 1 root disk 8, 128 Jul  3 06:38 /dev/sdi
brw-rw---- 1 root disk 8, 144 Jul  3 06:38 /dev/sdj
brw-rw---- 1 root disk 8, 160 Jul  3 06:38 /dev/sdk

NDM log:
harvester-node-disk-manager-4v85m.log

    - That would be give a chance to run e2fsck
    - Also change the DevPath to the current host path

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - now we are triggered by uevent
    - minor log improvement

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - SingleDisk: autoprovision, unprovision, provision
    - Hotplug: add, remove

    Some minor changes including:
    - run ci-integration test directly
    - improve debug information when test fail

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
  - golang to v1.19
  - wrangler to v1.1.1
  - client-go to v0.24.13
  - longhorn to v1.4.2, also bump to v1beta2 api
  - golangci-lint to v1.52.0

  Also, minor fix the ci complain

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - That would be helpful for debugging
    - Add more check for integration tests
    - Using longhorn v1.4.2

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
  - Wraper the cond lock for caller
  - Use block channel for waiting scanner stop
  - Improve the log message with uevent

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
  - now some functions are replaced on standard library `os`

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    We need to check the `DiskAddedToNode` with the provision device
    because we may fail to update the device.
    That means the device status is not `Provisioned`, but the LH
    node already has the disk.
    We would not do the next update to make the device `Provisioned`.

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - Add xml sample to let user to add disk w/ WWN

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - change json format to yaml
    - skip G204 (gosec)
    - fix others lint errors

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
    - Add upgrade test on CI. The detail steps as below

      1. Install latest version
      2. Add disk
      3. Upgrade to the develping image
      4. Run integration test

Signed-off-by: Vicente Cheng <vicente.cheng@suse.com>
Copy link
Member

@bk201 bk201 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants