-
Notifications
You must be signed in to change notification settings - Fork 44
DLPX-67251 Device removal fails due to inconsistent device names #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bors delegate+ |
✌️ shartse can now approve this pull request. To approve and merge a pull request, simply reply with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level question regarding this
2. Wait until all links are created before importing (most likely, by calling udevadm settle)
delphix-migration.service
already depends on After=systemd-udev-settle.service
.
From my understanding, the issue is that when we import by-id
, we do so with:
zpool import -fN -d /dev/disk/by-id domain0
So there is no way to tell zfs to use scsi
links instead of wwn
links, is that right?
...n/var/lib/delphix-platform/ansible/10-delphix-platform/roles/delphix-platform/tasks/main.yml
Outdated
Show resolved
Hide resolved
@pzakha - yeah, that's right. You can provide a directory to Hm, if we're already waiting for udev settle maybe my understanding of why we're getting a mix of links isn't quite right. This fix should prevent it though. |
bors r+ |
162: DLPX-67251 Device removal fails due to inconsistent device names r=shartse a=shartse **Problem** On ESX, we use by-link links to create pools and to import them on migrated systems. There are two links per device, each with the same base number but with a different prefix (wwn vs scsi). When creating a pool for the first time, we explicitly use the wwn link, if it's available. However, it is not possible to specify this when importing a pool and since device links are created by udev asynchronously, sometimes a migrated pool can end up with a combination of wwn and scsi linked devices. This causes issues when we try to manage the devices with zpool commands passed through the delphix application. The DE thinks the correct name of the pool is the wwn version (since it's now present), but that doesn’t actually exist in the pool and the operation fails. **Possible Solutions** 1. Expand zpool import to be able to specify precise files (and prefixes) to always pick the wwn links 2. Wait until all links are created before importing (most likely, by calling `udevadm settle`) 3. Modify the udev rules so that only one by-id link per device is created. 4. Change DE so that it can identify devices within pools with greater flexibility. I decided against 1 and 4 since they'd require somewhat larger scope changes in ZFS and the app-stack respectively. 2 would add extra latency to migration, especially since `settle` waits for all different udev events, not just the ones pertinent to the devices we care about. I found that option 3 can be achieved by modifying the existing udev rules, so that's what I've gone with here. I've also opened an app-gate review here: http://reviews.delphix.com/r/54112 to codify the use of the scsi prefix ids by default. In the future, I'd like to move towards a solution where we write our own udev rules that covers devices across all platforms we support to get a single "delphix-id" and hopefully reduce the complexity here and in the app-stack. **Testing** I manually tested that a migration to a VM with this change was successful and saw that the pool was created using all scsi links. ``` domain0 22.5G 44.7M 22.5G - - 0% 0% 1.00x ONLINE - scsi-36000c299e8f8f8e643410a9a5aaf3595 7.50G 23.8M 7.48G - - 0% 0.30% - ONLINE scsi-36000c29ca33c1d0b3f1ee232d02652fd 7.50G 16.7M 7.48G - - 0% 0.21% - ONLINE scsi-36000c29359be70663180556fbd93d05a 7.50G 4.22M 7.50G - - 0% 0.05% - ONLINE ``` I also checked that configuring, adding and removing the devices all worked as expected on a migrated as well as a clean installed VM. I also tested the change on a GCP VM (the other platform we use by-id links on) and found no change in behavior. Automated tests: `git-ab-pre-push --test-upgrade-from 5.3.6.0 -p esx` http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2500/ Co-authored-by: sara hartse <sara.hartse@delphix.com>
Build succeeded
|
Problem
On ESX, we use by-link links to create pools and to import them on migrated systems. There are two links per device, each with the same base number but with a different prefix (wwn vs scsi). When creating a pool for the first time, we explicitly use the wwn link, if it's available.
However, it is not possible to specify this when importing a pool and since device links are created by udev asynchronously, sometimes a migrated pool can end up with a combination of wwn and scsi linked devices.
This causes issues when we try to manage the devices with zpool commands passed through the delphix application. The DE thinks the correct name of the pool is the wwn version (since it's now present), but that doesn’t actually exist in the pool and the operation fails.
Possible Solutions
udevadm settle
)I decided against 1 and 4 since they'd require somewhat larger scope changes in ZFS and the app-stack respectively. 2 would add extra latency to migration, especially since
settle
waits for all different udev events, not just the ones pertinent to the devices we care about.I found that option 3 can be achieved by modifying the existing udev rules, so that's what I've gone with here. I've also opened an app-gate review here: http://reviews.delphix.com/r/54112 to codify the use of the scsi prefix ids by default.
In the future, I'd like to move towards a solution where we write our own udev rules that covers devices across all platforms we support to get a single "delphix-id" and hopefully reduce the complexity here and in the app-stack.
Testing
I manually tested that a migration to a VM with this change was successful and saw that the pool was created using all scsi links.
I also checked that configuring, adding and removing the devices all worked as expected on a migrated as well as a clean installed VM. I also tested the change on a GCP VM (the other platform we use by-id links on) and found no change in behavior.
Automated tests:
git-ab-pre-push --test-upgrade-from 5.3.6.0 -p esx
http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2500/