-
Notifications
You must be signed in to change notification settings - Fork 44
Merge master into 6.0/stage #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
143: Disable "zfs-volume-wait" service inside container r=prakashsurya a=prakashsurya Co-authored-by: Prakash Surya <prakash.surya@delphix.com>
The root-cause of this bug is that the ssh systemd service doesn't have a dependency on network interface configuration. By default, when using DHCP, the sshd daemon listens on the unspecified address (0.0.0.0). When the system is configured with static IP addresses, however, each address gets included individually as "ListenAddress" directives in /etc/ssh/sshd_config. This results in sshd binding to and listening on each address individually. If, at startup, the addresses listed there are not configured, sshd will fail to bind to them, and will not listen for connections to those addresses. When that happens, we can see sshd output errors in the ssh service journal: -- Reboot -- Oct 01 22:39:37 localhost sshd[604]: Server listening on 127.0.0.1 port 22. Oct 01 22:39:37 localhost sshd[604]: error: Bind to port 22 on 10.43.42.64 faile The fix is to have the ssh service depend on the network.target systemd unit.
145: Remove glances package r=jgallag88 a=jgallag88 This package provides a neat terminal UI for monitoring various aspects of a system, but it pulls in about 80MB of dependencies and the information it provides can be obtained through other means. Co-authored-by: John Gallagher <john.gallagher@delphix.com>
146: DLPX-66267 SSH service stops listening to external sources after reboot r=sebroy a=sebroy The root-cause of this bug is that the ssh systemd service doesn't have a dependency on network interface configuration. By default, when using DHCP, the sshd daemon listens on the unspecified address (0.0.0.0). When the system is configured with static IP addresses, however, each address gets included individually as "ListenAddress" directives in /etc/ssh/sshd_config. This results in sshd binding to and listening on each address individually. If, at startup, the addresses listed there are not configured, sshd will fail to bind to them, and will not listen for connections to those addresses. When that happens, we can see sshd output errors in the ssh service journal: ``` -- Reboot -- Oct 01 22:39:37 localhost sshd[604]: Server listening on 127.0.0.1 port 22. Oct 01 22:39:37 localhost sshd[604]: error: Bind to port 22 on 10.43.42.64 faile ``` The fix is to have the ssh service depend on the network.target systemd unit. With this fix, I've confirmed that sshd succeeds to bind to all static addresses. The critical path for the ssh startup at boot now looks like this: ``` delphix@localhost:~$ sudo systemd-analyze critical-chain ssh.service The time after the unit is active or started is printed after the "@" character. The time the unit takes to start is printed after the "+" character. ssh.service +51ms └─network.target @5.309s └─systemd-resolved.service @4.184s +1.123s └─systemd-networkd.service @2.048s +2.134s └─network-pre.target @2.047s └─cloud-init-local.service @956ms +1.090s └─open-vm-tools.service @948ms └─vgauth.service @946ms └─systemd-tmpfiles-setup.service @918ms +23ms └─systemd-journal-flush.service @689ms +226ms └─var-log.mount @589ms +97ms └─local-fs-pre.target @580ms └─systemd-tmpfiles-setup-dev.service @548ms +24ms └─kmod-static-nodes.service @416ms +86ms └─systemd-journald.socket @415ms └─system.slice @399ms └─-.slice @322ms ``` Co-authored-by: Sebastien Roy <seb@delphix.com>
148: Add debugging symbols for important packages r=prakashsurya a=jgallag88 Adds debug symbols for dependencies that are built in-house or are particularly important. Co-authored-by: John Gallagher <john.gallagher@delphix.com>
149: DLPX-66624 iscsi and nfs drivers missing from kvm qcow2 image r=prakashsurya a=prakashsurya Co-authored-by: George Wilson <george.wilson@delphix.com>
154: DLPX-66534 arc_prune consumes all cpus r=grwilson a=grwilson Co-authored-by: George Wilson <george.wilson@delphix.com>
155: DLPX-66227 Disk I/O scheduler should be `noop` rather than default `cfq` r=tonynguien a=tonynguien Disk should have `noop` I/O scheduler for optimal performance since ZFS will schedule I/Os. Co-authored-by: Tony Nguyen <tony.nguyen@delphix.com>
153: DLPX-65491 Invalid argument when mounting ZFS filesystem r=grwilson a=grwilson Co-authored-by: George Wilson <george.wilson@delphix.com>
…plan files The root-cause of this issue is that the service that generates the default netplan file for cloud-init (named cloud-init-local) can run at the same time as delphix-migration which writes our own custom netplan file on-disk (and potentially deletes the default one if the timing is right). Unfortunately, timing is not always right and due to the above raace between the two services we end up with two netplan files that can have conflicting info. This change ensures that the migration service runs after cloud-init-local so the default netplan file is always generated before ouyr custom one takes its place. Note again that this is a migration-only issue that can happen on first boot. We disable the cloud-init-local service from regenerating its netplan file for subsequent boots.
159: DLPX-67281 Network configuration not migrated because of multiple netplan files r=pzakha a=sdimitro # Commit Description The root-cause of this issue is that the service that generates the default netplan file for cloud-init (named cloud-init-local) can run at the same time as delphix-migration which writes our own custom netplan file on-disk (and potentially deletes the default one if the timing is right). Unfortunately, timing is not always right and due to the above raace between the two services we end up with two netplan files that can have conflicting info. This change ensures that the migration service runs after cloud-init-local so the default netplan file is always generated before ouyr custom one takes its place. Note again that this is a migration-only issue that can happen on first boot. We disable the cloud-init-local service from regenerating its netplan file for subsequent boots. # Testing (pending test results) Co-authored-by: Serapheim Dimitropoulos <serapheim@delphix.com>
162: DLPX-67251 Device removal fails due to inconsistent device names r=shartse a=shartse **Problem** On ESX, we use by-link links to create pools and to import them on migrated systems. There are two links per device, each with the same base number but with a different prefix (wwn vs scsi). When creating a pool for the first time, we explicitly use the wwn link, if it's available. However, it is not possible to specify this when importing a pool and since device links are created by udev asynchronously, sometimes a migrated pool can end up with a combination of wwn and scsi linked devices. This causes issues when we try to manage the devices with zpool commands passed through the delphix application. The DE thinks the correct name of the pool is the wwn version (since it's now present), but that doesn’t actually exist in the pool and the operation fails. **Possible Solutions** 1. Expand zpool import to be able to specify precise files (and prefixes) to always pick the wwn links 2. Wait until all links are created before importing (most likely, by calling `udevadm settle`) 3. Modify the udev rules so that only one by-id link per device is created. 4. Change DE so that it can identify devices within pools with greater flexibility. I decided against 1 and 4 since they'd require somewhat larger scope changes in ZFS and the app-stack respectively. 2 would add extra latency to migration, especially since `settle` waits for all different udev events, not just the ones pertinent to the devices we care about. I found that option 3 can be achieved by modifying the existing udev rules, so that's what I've gone with here. I've also opened an app-gate review here: http://reviews.delphix.com/r/54112 to codify the use of the scsi prefix ids by default. In the future, I'd like to move towards a solution where we write our own udev rules that covers devices across all platforms we support to get a single "delphix-id" and hopefully reduce the complexity here and in the app-stack. **Testing** I manually tested that a migration to a VM with this change was successful and saw that the pool was created using all scsi links. ``` domain0 22.5G 44.7M 22.5G - - 0% 0% 1.00x ONLINE - scsi-36000c299e8f8f8e643410a9a5aaf3595 7.50G 23.8M 7.48G - - 0% 0.30% - ONLINE scsi-36000c29ca33c1d0b3f1ee232d02652fd 7.50G 16.7M 7.48G - - 0% 0.21% - ONLINE scsi-36000c29359be70663180556fbd93d05a 7.50G 4.22M 7.50G - - 0% 0.05% - ONLINE ``` I also checked that configuring, adding and removing the devices all worked as expected on a migrated as well as a clean installed VM. I also tested the change on a GCP VM (the other platform we use by-id links on) and found no change in behavior. Automated tests: `git-ab-pre-push --test-upgrade-from 5.3.6.0 -p esx` http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2500/ Co-authored-by: sara hartse <sara.hartse@delphix.com>
164: Add performance-diagnostics package r=jgallag88 a=jgallag88 Co-authored-by: John Gallagher <john.gallagher@delphix.com>
161: DLPX-67394 Increase postgres service timeout during migration (Part 2 of 2) r=pzakha a=pzakha Part 1 of 2: http://reviews.delphix.com/r/54101 See JIRA for description. Caveat: For the internal-dev variant we deploy other override.conf files under `/etc/systemd/...`, so the files for the same service under `/run/systemd/...` are ignored by systemd. An alternative would be to edit `/etc/systemd` files instead, but the logic would be more complex and prone to failure, so I've opted for this instead. ## Testing Testing this change only: - migration: http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2490/ Testing both parts together: see http://reviews.delphix.com/r/54101 Co-authored-by: Pavel Zakharov <pavel.zakharov@delphix.com>
…n boot on 5.0 kernel
168: DLPX-67583 migration: floppy driver sometimes causes system to hang on boot on 5.0 kernel r=pzakha a=pzakha See JIRA for details. ## Testing - manually tested fix on affected system - migration pre-push on esx: http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2544/ Co-authored-by: Pavel Zakharov <pavel.zakharov@delphix.com>
167: DLPX-67545 start rate limit for systemd-networkd should be in unit section r=grwilson a=grwilson Co-authored-by: George Wilson <george.wilson@delphix.com>
…ix#170) This fix disables the Ubuntu-provided motd banner and replaces it with a simple Delphix-specific banner. It also disables the motd-news systemd service that dynamically fetches news from a public Internet service.
…e up a large percentage of syslog (delphix#266) DLPX-72681 delphix-startup-screen fails with static IP address DLPX-73286 systemd is restarting locale service every 30 seconds DLPX-73423 delphix-startup-screen crashes if there's no default route
…table permissions (delphix#276)
…tform service failed (delphix#275)
…ting "time stamp ... in the future" (delphix#307)
This backports the following changes: - Move open-iscsi override out of /etc (delphix#263) - changes for docker dep (delphix#311) (which is part of DLPX-76534) - Replace TravisCI with Github Actions (delphix#212) - Use "delphix/actions" for shellcheck and shfmt (delphix#220) Note that this also adds the following file from master: - .github/workflows/main.yml
This brings 6.0/stage in sync with master. |
prakashsurya
approved these changes
Aug 3, 2021
sdimitro
approved these changes
Aug 4, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This backports the following changes:
Note that this also adds the following file from master:
Testing