Skip to content

Merge master into 6.0/stage #318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 98 commits into from
Aug 4, 2021
Merged

Merge master into 6.0/stage #318

merged 98 commits into from
Aug 4, 2021

Conversation

pzakha
Copy link
Contributor

@pzakha pzakha commented Aug 2, 2021

This backports the following changes:

Note that this also adds the following file from master:

  • .github/workflows/main.yml

Testing

Prakash Surya and others added 30 commits September 27, 2019 13:33
143: Disable "zfs-volume-wait" service inside container r=prakashsurya a=prakashsurya



Co-authored-by: Prakash Surya <prakash.surya@delphix.com>
The root-cause of this bug is that the ssh systemd service doesn't have
a dependency on network interface configuration.

By default, when using DHCP, the sshd daemon listens on the unspecified
address (0.0.0.0). When the system is configured with static IP
addresses, however, each address gets included individually as
"ListenAddress" directives in /etc/ssh/sshd_config. This results in sshd
binding to and listening on each address individually. If, at startup,
the addresses listed there are not configured, sshd will fail to bind to
them, and will not listen for connections to those addresses. When that
happens, we can see sshd output errors in the ssh service journal:

 -- Reboot --
Oct 01 22:39:37 localhost sshd[604]: Server listening on 127.0.0.1 port 22.
Oct 01 22:39:37 localhost sshd[604]: error: Bind to port 22 on 10.43.42.64 faile

The fix is to have the ssh service depend on the network.target systemd unit.
145: Remove glances package r=jgallag88 a=jgallag88

This package provides a neat terminal UI for monitoring various aspects of a system, but it pulls in about 80MB of dependencies and the information it provides can be obtained through other means.

Co-authored-by: John Gallagher <john.gallagher@delphix.com>
146: DLPX-66267 SSH service stops listening to external sources after reboot r=sebroy a=sebroy

The root-cause of this bug is that the ssh systemd service doesn't have
a dependency on network interface configuration.

By default, when using DHCP, the sshd daemon listens on the unspecified
address (0.0.0.0). When the system is configured with static IP
addresses, however, each address gets included individually as
"ListenAddress" directives in /etc/ssh/sshd_config. This results in sshd
binding to and listening on each address individually. If, at startup,
the addresses listed there are not configured, sshd will fail to bind to
them, and will not listen for connections to those addresses. When that
happens, we can see sshd output errors in the ssh service journal:
```
 -- Reboot --
Oct 01 22:39:37 localhost sshd[604]: Server listening on 127.0.0.1 port 22.
Oct 01 22:39:37 localhost sshd[604]: error: Bind to port 22 on 10.43.42.64 faile
```
The fix is to have the ssh service depend on the network.target systemd unit.

With this fix, I've confirmed that sshd succeeds to bind to all static addresses. The critical path for the ssh startup at boot now looks like this:
```
delphix@localhost:~$ sudo systemd-analyze critical-chain ssh.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

ssh.service +51ms
└─network.target @5.309s
  └─systemd-resolved.service @4.184s +1.123s
    └─systemd-networkd.service @2.048s +2.134s
      └─network-pre.target @2.047s
        └─cloud-init-local.service @956ms +1.090s
          └─open-vm-tools.service @948ms
            └─vgauth.service @946ms
              └─systemd-tmpfiles-setup.service @918ms +23ms
                └─systemd-journal-flush.service @689ms +226ms
                  └─var-log.mount @589ms +97ms
                    └─local-fs-pre.target @580ms
                      └─systemd-tmpfiles-setup-dev.service @548ms +24ms
                        └─kmod-static-nodes.service @416ms +86ms
                          └─systemd-journald.socket @415ms
                            └─system.slice @399ms
                              └─-.slice @322ms
```

Co-authored-by: Sebastien Roy <seb@delphix.com>
148: Add debugging symbols for important packages r=prakashsurya a=jgallag88

Adds debug symbols for dependencies that are built in-house or are particularly important.

Co-authored-by: John Gallagher <john.gallagher@delphix.com>
149: DLPX-66624 iscsi and nfs drivers missing from kvm qcow2 image r=prakashsurya a=prakashsurya



Co-authored-by: George Wilson <george.wilson@delphix.com>
154: DLPX-66534 arc_prune consumes all cpus r=grwilson a=grwilson



Co-authored-by: George Wilson <george.wilson@delphix.com>
155: DLPX-66227 Disk I/O scheduler should be `noop` rather than default `cfq` r=tonynguien a=tonynguien

Disk should have `noop` I/O scheduler for optimal performance since ZFS will schedule I/Os.

Co-authored-by: Tony Nguyen <tony.nguyen@delphix.com>
153: DLPX-65491 Invalid argument when mounting ZFS filesystem r=grwilson a=grwilson



Co-authored-by: George Wilson <george.wilson@delphix.com>
…plan files

The root-cause of this issue is that the service that generates the
default netplan file for cloud-init (named cloud-init-local) can run
at the same time as delphix-migration which writes our own custom
netplan file on-disk (and potentially deletes the default one if the
timing is right). Unfortunately, timing is not always right and due
to the above raace between the two services we end up with two
netplan files that can have conflicting info.

This change ensures that the migration service runs after cloud-init-local
so the default netplan file is always generated before ouyr custom one
takes its place.

Note again that this is a migration-only issue that can happen on
first boot. We disable the cloud-init-local service from regenerating
its netplan file for subsequent boots.
159: DLPX-67281 Network configuration not migrated because of multiple netplan files r=pzakha a=sdimitro

# Commit Description

The root-cause of this issue is that the service that generates the
default netplan file for cloud-init (named cloud-init-local) can run
at the same time as delphix-migration which writes our own custom
netplan file on-disk (and potentially deletes the default one if the
timing is right). Unfortunately, timing is not always right and due
to the above raace between the two services we end up with two
netplan files that can have conflicting info.

This change ensures that the migration service runs after cloud-init-local
so the default netplan file is always generated before ouyr custom one
takes its place.

Note again that this is a migration-only issue that can happen on
first boot. We disable the cloud-init-local service from regenerating
its netplan file for subsequent boots.

# Testing
 
(pending test results)

Co-authored-by: Serapheim Dimitropoulos <serapheim@delphix.com>
162: DLPX-67251 Device removal fails due to inconsistent device names r=shartse a=shartse

**Problem**
On ESX, we use by-link links to create pools and to import them on migrated systems. There are two links per device, each with the same base number but with a different prefix (wwn vs scsi). When creating a pool for the first time, we explicitly use the wwn link, if it's available. 

However, it is not possible to specify this when importing a pool and since device links are created by udev asynchronously, sometimes a migrated pool can end up with a combination of wwn and scsi linked devices.

This causes issues when we try to manage the devices with zpool commands passed through the delphix application. The DE thinks the correct name of the pool is the wwn version (since it's now present), but that doesn’t actually exist in the pool and the operation fails. 

**Possible Solutions** 
1. Expand zpool import to be able to specify precise files (and prefixes) to always pick the wwn links
2. Wait until all links are created before importing (most likely, by calling `udevadm settle`)
3. Modify the udev rules so that only one by-id link per device is created. 
4. Change DE so that it can identify devices within pools with greater flexibility. 

I decided against 1 and 4 since they'd require somewhat larger scope changes in ZFS and the app-stack respectively. 2 would add extra latency to migration, especially since `settle` waits for all different udev events, not just the ones pertinent to the devices we care about. 

I found that option 3 can be achieved by modifying the existing udev rules, so that's what I've gone with here. I've also opened an app-gate review here: http://reviews.delphix.com/r/54112 to codify the use of the scsi prefix ids by default.  

In the future, I'd like to move towards a solution where we write our own udev rules that covers devices across all platforms we support to get a single "delphix-id" and hopefully reduce the complexity here and in the app-stack.

**Testing**
I manually tested that a migration to a VM with this change was successful and saw that the pool was created using all scsi links. 
```
domain0                                   22.5G  44.7M  22.5G        -         -     0%     0%  1.00x    ONLINE  -
  scsi-36000c299e8f8f8e643410a9a5aaf3595  7.50G  23.8M  7.48G        -         -     0%  0.30%      -  ONLINE  
  scsi-36000c29ca33c1d0b3f1ee232d02652fd  7.50G  16.7M  7.48G        -         -     0%  0.21%      -  ONLINE  
  scsi-36000c29359be70663180556fbd93d05a  7.50G  4.22M  7.50G        -         -     0%  0.05%      -  ONLINE  
```
I also checked that configuring, adding and removing the devices all worked as expected on a migrated as well as a clean installed VM. I also tested the change on a GCP VM (the other platform we use by-id links on) and found no change in behavior. 

Automated tests: `git-ab-pre-push --test-upgrade-from 5.3.6.0 -p esx` http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2500/

Co-authored-by: sara hartse <sara.hartse@delphix.com>
164: Add performance-diagnostics package r=jgallag88 a=jgallag88



Co-authored-by: John Gallagher <john.gallagher@delphix.com>
161: DLPX-67394 Increase postgres service timeout during migration (Part 2 of 2) r=pzakha a=pzakha

Part 1 of 2: http://reviews.delphix.com/r/54101

See JIRA for description.

Caveat:
For the internal-dev variant we deploy other override.conf files under `/etc/systemd/...`, so the files for the same service under `/run/systemd/...` are ignored by systemd. An alternative would be to edit `/etc/systemd` files instead, but the logic would be more complex and prone to failure, so I've opted for this instead.

## Testing
Testing this change only:
- migration: http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2490/

Testing both parts together: see http://reviews.delphix.com/r/54101


Co-authored-by: Pavel Zakharov <pavel.zakharov@delphix.com>
168: DLPX-67583 migration: floppy driver sometimes causes system to hang on boot on 5.0 kernel r=pzakha a=pzakha

See JIRA for details.
## Testing
- manually tested fix on affected system
- migration pre-push on esx: http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/appliance-build-orchestrator-pre-push/2544/

Co-authored-by: Pavel Zakharov <pavel.zakharov@delphix.com>
167: DLPX-67545 start rate limit for systemd-networkd should be in unit section r=grwilson a=grwilson



Co-authored-by: George Wilson <george.wilson@delphix.com>
…ix#170)


This fix disables the Ubuntu-provided motd banner and replaces it with a simple Delphix-specific banner. It also disables the motd-news systemd service that dynamically fetches news from a public Internet service.
Don Brady and others added 21 commits January 27, 2021 15:46
…e up a large percentage of syslog (delphix#266)

DLPX-72681 delphix-startup-screen fails with static IP address
DLPX-73286 systemd is restarting locale service every 30 seconds
DLPX-73423 delphix-startup-screen crashes if there's no default route
This backports the following changes:
- Move open-iscsi override out of /etc (delphix#263)
- changes for docker dep (delphix#311) (which is part of DLPX-76534)
- Replace TravisCI with Github Actions (delphix#212)
- Use "delphix/actions" for shellcheck and shfmt (delphix#220)

Note that this also adds the following file from master:
- .github/workflows/main.yml
@pzakha
Copy link
Contributor Author

pzakha commented Aug 2, 2021

This brings 6.0/stage in sync with master.

@pzakha pzakha changed the title Merge Merge master into 6.0/stage Aug 3, 2021
@pzakha pzakha merged commit d03910d into delphix:6.0/stage Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.