Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner detect ACPI shutdown #1068

Merged
merged 36 commits into from
Jul 4, 2022
Merged

Runner detect ACPI shutdown #1068

merged 36 commits into from
Jul 4, 2022

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Jun 19, 2022

In GCP and AWS spot instances are terminated via ACPI shutdown. This PR also fixes the resource termination stopping the runner from the cloud console

@DavidGOrtega DavidGOrtega temporarily deployed to internal June 19, 2022 22:03 Inactive
@DavidGOrtega DavidGOrtega added the blocked Dependent on something else label Jun 19, 2022
@DavidGOrtega DavidGOrtega marked this pull request as draft June 19, 2022 22:03
@DavidGOrtega DavidGOrtega self-assigned this Jun 19, 2022
@DavidGOrtega DavidGOrtega added p1-important High priority cml-runner Subcommand cloud-aws Amazon Web Services cloud-gcp Google Cloud labels Jun 19, 2022
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 28, 2022 16:41 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 28, 2022 17:16 Inactive
@DavidGOrtega DavidGOrtega temporarily deployed to internal June 28, 2022 17:34 Inactive
@dacbd dacbd temporarily deployed to internal June 30, 2022 18:29 Inactive
Copy link
Contributor

@dacbd dacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* revert this

* log warning verbage

* revert this

* move acpiSock to runLocal

* connect info log

* Revert "revert this"

This reverts commit f4c0c04.

* Revert "revert this"

This reverts commit 8a77fae.

* typo
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 1, 2022 09:08 Inactive
@dacbd dacbd temporarily deployed to internal July 1, 2022 17:50 Inactive
@dacbd dacbd self-requested a review July 1, 2022 17:51
Copy link
Contributor

@dacbd dacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go to master

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemd service wrapping cml runner suffices to send a TERM signal to the process when the instance transitions to the ACPI G2 state. I believe that this pull request solves an XY problem, unless proven otherwise.

I'm not opposed to merging this pull request if it solves a critical issue, but I would suggest, with a lower priority:

  1. Reading Discuss about the inclusion of SpotNotifier #663 (comment) for more information on alternative approaches, like polling for spot instance termination signals through the internal metadata service.
  2. Making sure that cml runner is able to handle gracefully a TERM signal and doesn't exit until all the cleanup routines run.

To my mind, cml runner should not include any logic to deal with termination notices other than a plain SIGTERM. All the fancy platform-specific features like shutdown and spot instance eviction detection should be part of the startup script on our Terraform provider, which should translate those signals into the aforementioned SIGTERM.

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 1, 2022

The systemd service wrapping cml runner suffices to send a TERM signal to the process when the instance transitions to the ACPI G2 state. I believe that this pull request solves an XY problem, unless proven otherwise.

I'm not opposed to merging this pull request if it solves a critical issue, but I would suggest, with a lower priority:

  1. Reading #663 (comment) for more information on alternative approaches, like polling for spot instance termination signals through the internal metadata service.
  2. Making sure that cml runner is able to handle gracefully a TERM signal and doesn't exit until all the cleanup routines run.

This PR not only fixes the spot termination, indeed fixes the error of job restart setting the ids, and the recent bug of restarting already finished jobs
And of course you can test that this PR behaves as expected trying to stop a normal runner in AWS and GPC and see that in many cases it does not unregister while listening to the acpid always do it right

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 1, 2022

should not include any logic to deal with termination notices other than a plain SIGTERM

I need to check why TPI suffices with SIGTERM. The runner with SIGTERM (behaving perfectly) is not always triggered?!
Quick test. Start a runner in GH/GCP and stop it. It wont unregister

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 2, 2022

Thanks! I'll try to reproduce. 🙏🏼

@DavidGOrtega DavidGOrtega marked this pull request as ready for review July 4, 2022 12:19
@DavidGOrtega DavidGOrtega temporarily deployed to internal July 4, 2022 12:19 Inactive
@DavidGOrtega DavidGOrtega merged commit 6cc6b9f into master Jul 4, 2022
@DavidGOrtega DavidGOrtega deleted the runner/detect-terminate branch July 4, 2022 12:31
@0x2b3bfa0
Copy link
Member

Unable to reproduce on AWS nor GCP; tried 3 batches of 30 runners on the former, and 10 runners on the latter.

@dacbd
Copy link
Contributor

dacbd commented Jul 6, 2022

I can say that I was able to see the bad detection behavior once pre-pr on GCP, but I saw also the correct behavior several times, and also struggled to reproduce it. I see think that: #1016 or operating the runner process as a separate service that cml.service monitors, is a better solution.

@0x2b3bfa0
Copy link
Member

I was able to see the bad detection behavior

Are you sure that it was caused because the cml process didn't get a SIGTERM or could it be an issue in Node.js or our code handling it?

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 7, 2022

I can say that I was able to see the bad detection behavior once pre-pr on GCP, but I saw also the correct behavior several times, and also struggled to reproduce it.

This coincides with my experience and I have been always saying this... SIGTERM systemd is not consistent in my experience (and others). No idea why because it should

@dacbd
Copy link
Contributor

dacbd commented Jul 7, 2022

I was able to see the bad detection behavior

Are you sure that it was caused because the cml process didn't get a SIGTERM or could it be an issue in Node.js or our code handling it?

both were from an organic spot termination and happened when I was not monitoring the instance and thus have no idea...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-aws Amazon Web Services cloud-az Microsoft Azure cloud-gcp Google Cloud cml-runner Subcommand external-request You asked, we did p0-critical Max priority (ASAP)
Projects
None yet
5 participants