-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner detect ACPI shutdown #1068
Conversation
…-no-special-cases
…-no-special-cases
…-no-special-cases
…-no-special-cases
Co-authored-by: Daniel Barnes <dabarnes2b@gmail.com>
…e/cml into runner-no-special-cases
…-no-special-cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go to master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The systemd service wrapping cml runner
suffices to send a TERM signal to the process when the instance transitions to the ACPI G2 state. I believe that this pull request solves an XY problem, unless proven otherwise.
I'm not opposed to merging this pull request if it solves a critical issue, but I would suggest, with a lower priority:
- Reading Discuss about the inclusion of SpotNotifier #663 (comment) for more information on alternative approaches, like polling for spot instance termination signals through the internal metadata service.
- Making sure that
cml runner
is able to handle gracefully a TERM signal and doesn't exit until all the cleanup routines run.
To my mind, cml runner
should not include any logic to deal with termination notices other than a plain SIGTERM. All the fancy platform-specific features like shutdown and spot instance eviction detection should be part of the startup script on our Terraform provider, which should translate those signals into the aforementioned SIGTERM.
This PR not only fixes the spot termination, indeed fixes the error of job restart setting the ids, and the recent bug of restarting already finished jobs |
I need to check why TPI suffices with SIGTERM. The runner with SIGTERM (behaving perfectly) is not always triggered?! |
Thanks! I'll try to reproduce. 🙏🏼 |
Unable to reproduce on AWS nor GCP; tried 3 batches of 30 runners on the former, and 10 runners on the latter. |
I can say that I was able to see the bad detection behavior once pre-pr on GCP, but I saw also the correct behavior several times, and also struggled to reproduce it. I see think that: #1016 or operating the runner process as a separate service that |
Are you sure that it was caused because the |
This coincides with my experience and I have been always saying this... |
both were from an organic spot termination and happened when I was not monitoring the instance and thus have no idea... |
In GCP and AWS spot instances are terminated via ACPI shutdown. This PR also fixes the resource termination stopping the runner from the cloud console
acpi
in TPIrunner
restartWorkflow is not working #1072pipelineRerun
andpipelineRestart
removing the later