-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud runner terminate improvements #653
Conversation
DavidGOrtega
commented
Jul 11, 2021
•
edited
Loading
edited
- Introduces waitCompletedPipelineJobs a function that waits for expected completed jobs. fixes Instances intermittently fail to terminate #649
- Adds proc on exit event. fixes Single accepts only one job but does not cleanup #662. closes cml-runner - Better idle mechanism in GH #407
- After testing. closes GCP compute instances are not shutdown after idle timeout #661
- SpotNotifier. Closes CML seemingly fails to restart job after AWS Spot instances have been shut down #650
…-dispose-improvement
@@ -5,6 +5,7 @@ const { homedir } = require('os'); | |||
|
|||
const fs = require('fs').promises; | |||
const yargs = require('yargs'); | |||
const { SpotNotifier } = require('ec2-spot-notification'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just curl
the 169.254.169.254
endpoint in a loop and run systemctl stop cml
once we get a termination notice, all from the instance script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets open another ticket and use for now the already done work that also works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we might just want this behaviour with the runner, so I guess it should go in the TPI runner resource at most
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.
Actually I disagree, Its a part of the logic that we are setting to handle long job runners. The same that we do with the 72 hours in example. And ensures the integrty and responsability of the runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im thinking to do all in JS executed with V8. At the very end we are using the vendors SDK that already exists in JS
Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.
- More complex in the TPI
Not that complex.
- More obscure having another service or whatever...
Not as obscure as moving cloud-specific code to the CML repository instead of abstracting over signals, in my opinion.
- This runner could be launched manually inside an AWS instance and still works on its own
Do we really want to support this use case? What's the advantage of supporting it if we don't do the same for the other providers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All advantages despite that you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as obscure as moving cloud-specific code to the CML repository
CML repository is a runner also that can check whatever is needed to ensure the feature it offers. Transparent spot instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that complex.
We have to check also the noRetry... pass it.. create another service or whatever, business logic broken in multiple parts ... Ugly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.
Im telling you a solution to avoid Golang source code that Im planning to also unify the base source code that we were discussing
@0x2b3bfa0 are you approving it? We have several p0s that are fixed here. I want to release it ASAP |
I'm hesitant to approve the inclusion of SpotNotifier in the CML code base, especially without a follow-up technical debt / discussion issue. I'm going to run tests on Google Cloud to confirm that #661 gets fixed and check everything for approval, but I'd prefer to discuss this in the weekly meeting so we can take better decisions. |
Open a ticket to follow up. I do agree with it and I wont agree the opposite, so... To me there is not technical debt here.
I have tested it. The machines are disposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, except for the inclusion of SpotNotifier. Approving just because of #653 (comment) to avoid blocking priority 0 fixes.
@DavidGOrtega @0x2b3bfa0 thanks so much for this! |
Reporter of #661 checking in. Indeed, thanks for jumping on this! Unnnnnfortunately, I'm afraid that I'm still seeing the same behavior as before. (I'm assuming that the iterative/setup-cml@v1 action uses the latest cml release.) In 5 out of 5 separate runs, my Google CE instances stayed alive rather than being shutdown. Seeing as your tests passed though, maybe it's user error on my end? |
@lemontheme I would need to see your workflow. Tests with TPI indicates that the instances are disposed after the expected time. |
@DavidGOrtega sorry for the late reply. I tested with the same workflow as in #661 |
@lemontheme, can you please open a new issue? |