Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacStadium maintenance window on January 23rd #3616

Closed
21 of 26 tasks
UlisesGascon opened this issue Jan 22, 2024 · 21 comments
Closed
21 of 26 tasks

MacStadium maintenance window on January 23rd #3616

UlisesGascon opened this issue Jan 22, 2024 · 21 comments
Assignees

Comments

@UlisesGascon
Copy link
Member

UlisesGascon commented Jan 22, 2024

As described in ticket: SERVICE-176962

Dear OpenJS

On Tuesday, January 23rd, 2024, at 9 AM ET, we need to conduct a one-hour maintenance in our ATL data center that will impact your ORKA cluster for one hour. We apologize in advance.

Before the start of the maintenance, please save and shut down any VMs in advance of the maintenance start.

We will notify you once the nodes are back up here in the ticket. Again, we apologize in advance for any inconvenience this may cause. Thank you for your understanding.

Potential affected machines:

Next steps

I am not sure that I will be able to manage the "save and shut down" for the VMs before the deadline (tomorrow), anyone is available to do it (@nodejs/build)?

test-orka-macos10.15-x64-1:

  • Restore test-orka-macos10.15-x64-1 in Orka cluster
  • Reansible test-orka-macos10.15-x64-1
  • re-enable test-orka-macos10.15-x64-1 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos10.15-x64-2:

  • Restore test-orka-macos10.15-x64-2 in Orka cluster
  • Reansible test-orka-macos10.15-x64-2
  • re-enable test-orka-macos10.15-x64-2 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos11-x64-1:

  • Restore test-orka-macos11-x64-1 in Orka cluster
  • Reansible test-orka-macos11-x64-1
  • re-enable test-orka-macos11-x64-1 in Jenkins
  • Save the state
  • commit the image changes

test-orka-macos11-x64-2:

  • Restore test-orka-macos11-x64-2 in Orka cluster
  • Reansible test-orka-macos11-x64-2
  • re-enable test-orka-macos11-x64-2 in Jenkins
  • Save the state
  • commit the image changes

release-orka-macos11-x64-1:

  • Restore release-orka-macos11-x64-1 in Orka cluster
  • Reansible release-orka-macos11-x64-1
  • Manual steps on release-orka-macos11-x64-1
  • re-enable release-orka-macos11-x64-1 in Jenkins
  • Save the state
  • commit the image changes
@UlisesGascon
Copy link
Member Author

Update (9 AM ET):

We are beginning the maintenance and will update you once completed.

@UlisesGascon
Copy link
Member Author

Update (10 AM ET):

The maintenance is now completed. Thank you.

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Jan 23, 2024

We will need to recover the machines manually in order to make the Orka cluster working again. cc: @nodejs/build.

I am not available today, but I can try to work on it tomorrow (potentially), but feel free to take leadership if you want.

IMPORTANT: You can use this table (#3240 (comment)) as a reference to know where to locate the vms within the cluster in order to align the VMs with the inventory

@UlisesGascon
Copy link
Member Author

I am not available today, but I can try to work on it tomorrow (potentially), but feel free to take leadership if you want.

I am afraid that I won't be able to work on it today, I will start to work on it only from next Monday. 😓

@mhdawson
Copy link
Member

@UlisesGascon thanks for working on it. One question is if the machine recovery is needed because they were not shut down properly (I only noticed the original issue too late to help out) or if that would have been required regardless?

@UlisesGascon
Copy link
Member Author

One question is if the machine recovery is needed because they were not shut down properly (I only noticed the original issue too late to help out) or if that would have been required regardless?

This situation is a bit tricky, drawing from past experiences such as #3112. The VMs allocated in specific slots, including port mapping, are expected to be shut down and effectively 'removed' from the Orka cluster nodes.

Once the cluster is back, a manual relocation process is necessary to create new VMs using the images. This ensures the correct slots are filled, maintaining the expected mapping from the inventory and Jenkins (IPs and ports).

In this case, we didn't save and shut down the VMs before the process. Consequently, I suspect that the images, due to the destruction of VMs, might be an older version of the existing VMs. This will require re-ansibleing each VM once deployed and some manual configuration, particularly with the Jenkins tokens, depending on the state of the images.

I'll have a clearer picture on Monday. Unfortunately, I haven't been able to connect yet to check the status of the cluster or the nodes after the upgrade.

@mhdawson
Copy link
Member

@UlisesGascon I don't think I'm up to speed enough to do the bring back, but if a second set of hands would be helpfull when you have time to look at it and I'm around I'm happy to get on a call and help if that make sense.

@UlisesGascon
Copy link
Member Author

I will start to work on it now

@UlisesGascon
Copy link
Member Author

10.15 machines are back. I am working to re-ansible Macos11 VMs, but the process is taking time

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Jan 29, 2024

I'm currently facing some challenges with LLVM installation on macOS11. The build process seems unusually time-consuming, taking hours (whereas I recall it used to be around 30 minutes in the past). The process was so lengthy that the Ansible SSH connection generated a timeout. So, I just changed the strategy and execute this step manually (via SSH).

Screenshot 2024-01-29 at 18 20 42

I'm also puzzled about why the applied patch is ...arm64... since these are Intel machines. I've decided to let the process run overnight to see if the build generates any errors or if it finalizes properly.

@UlisesGascon
Copy link
Member Author

So, the machines made some progress during the night. Currently the machines continue installing dependencies (after restoring SSH sessions due timeouts), not sure why is so slow, but we are making progress.

@targos
Copy link
Member

targos commented Jan 30, 2024

I think these long compile/install steps are due to Homebrew removing support for outdated macOS (it has to install deps from source instead of downloading prebuilt binaries).

@UlisesGascon
Copy link
Member Author

I think these long compile/install steps are due to Homebrew removing support for outdated macOS (it has to install deps from source instead of downloading prebuilt binaries).

This makes totally sense. We need to commit the image changes after this process because the recovery process is very long.

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Jan 30, 2024

I am getting issues with the manual steps in release-orka-macos11-x64-1. sudo xcodebuild -license is hanging also git. Not sure what can be the issue. 🤔

The ansible process worked fine, I will finish soon with the manual steps for release-orka-macos11-x64-1

@UlisesGascon
Copy link
Member Author

So, release-orka-macos11-x64-1 seems to be working. I re-run this canary build to check that the machine is working as expected. This will unblock the releases 🥳

I am still working on macos11 test machines, the dependencies build is quite long

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Jan 30, 2024

🥳 test-orka-macos11-x64-1 and test-orka-macos11-x64-2 are back!

I will commit the image changes once the queue is reduced to zero, to avoid making more bottleneck effects in the PRs.

Here are the first jobs from the queue, I will check they are passing before doing the commit of the images:

Update: the CI jobs were fine as far I can see.

@UlisesGascon UlisesGascon self-assigned this Jan 30, 2024
@UlisesGascon
Copy link
Member Author

I will start with the image commit, so.. I will disconnect eventually the machines from Jenkins while doing the commit.

@UlisesGascon
Copy link
Member Author

I got an error while connecting to the VPN. I created a support ticket SERVICE-178721.

@UlisesGascon
Copy link
Member Author

The login error got solved, but I needed to open a separate ticket to ask for support as I am getting errors while saving the changes, Ticket SERVICE-178790

@UlisesGascon
Copy link
Member Author

Current status

I've cleaned up all the MacOS machines in ORKA as they were starting to generate space issues again. Additionally, I've created a state save for each machine.

Initially, I thought I needed to push this state to the VM images. However, some of them are using a common image, so it might not be a good idea. I've asked support what the best strategy is to maintain the VMs in this state regardless of any changes. I am waiting for a final response before closing this issue.
Screenshot 2024-02-27 at 15 15 07

@UlisesGascon
Copy link
Member Author

I think it is fine by now, so I am closing this issue in order to unblock #3642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants