Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud runner terminate improvements #653

Merged
merged 12 commits into from
Jul 20, 2021
Merged
2 changes: 1 addition & 1 deletion .github/workflows/test-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
GITHUB_TOKEN: ${{ github.token }}
TEST_GITHUB_TOKEN: ${{ secrets.TEST_GITHUB_TOKEN }}
TEST_GITHUB_REPO: https://github.com/iterative/cml_qa_tests_dummy
TEST_GITHUB_SHA: 62edc8b3f46a60b3fe1e5c08fd3e0046d350ee29
TEST_GITHUB_SHA: 0cd16da26e35f8e5d57b2549a97e22618abf08f6
TEST_GITLAB_TOKEN: ${{ secrets.TEST_GITLAB_TOKEN }}
TEST_GITLAB_REPO: https://gitlab.com/iterative.ai/cml_qa_tests_dummy
TEST_GITLAB_SHA: c4c13286e78dc252dd2611f31a755f10d343fbd4
Expand Down
50 changes: 41 additions & 9 deletions bin/cml-runner.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ const { homedir } = require('os');

const fs = require('fs').promises;
const yargs = require('yargs');
const { SpotNotifier } = require('ec2-spot-notification');
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just curl the 169.254.169.254 endpoint in a loop and run systemctl stop cml once we get a termination notice, all from the instance script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets open another ticket and use for now the already done work that also works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we might just want this behaviour with the runner, so I guess it should go in the TPI runner resource at most

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate this functionality and move it to the AMI and fallback on the terraform-provider-iterative repository? While this is a must have, keeping it as part of the CML code would introduce another unrelated responsibility to this codebase.

Actually I disagree, Its a part of the logic that we are setting to handle long job runners. The same that we do with the 72 hours in example. And ensures the integrty and responsability of the runner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im thinking to do all in JS executed with V8. At the very end we are using the vendors SDK that already exists in JS

Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.

  • More complex in the TPI

Not that complex.

  • More obscure having another service or whatever...

Not as obscure as moving cloud-specific code to the CML repository instead of abstracting over signals, in my opinion.

  • This runner could be launched manually inside an AWS instance and still works on its own

Do we really want to support this use case? What's the advantage of supporting it if we don't do the same for the other providers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All advantages despite that you disagree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as obscure as moving cloud-specific code to the CML repository

CML repository is a runner also that can check whatever is needed to ensure the feature it offers. Transparent spot instances

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that complex.

We have to check also the noRetry... pass it.. create another service or whatever, business logic broken in multiple parts ... Ugly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the specific advantages of V8 over any other engine for this use case, but anyhow. This is a completely separate discussion topic.

Im telling you a solution to avoid Golang source code that Im planning to also unify the base source code that we were discussing


const { exec, randid, sleep } = require('../src/utils');
const tf = require('../src/terraform');
Expand Down Expand Up @@ -234,12 +235,34 @@ const runLocal = async (opts) => {
} else if (log && log.status === 'job_ended') {
const { job } = log;

const waitCompletedPipelineJobs = () => {
return new Promise((resolve, reject) => {
try {
if (RUNNER_JOBS_RUNNING.length === 1) {
resolve([RUNNER_JOBS_RUNNING[0].id]);
return;
}

const watcher = setInterval(async () => {
const jobs = (
await cml.pipelineJobs({ jobs: RUNNER_JOBS_RUNNING })
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
)
.filter((job) => job.status === 'completed')
.map((job) => job.id);

if (jobs.length) {
resolve(jobs);
clearInterval(watcher);
}
}, 5 * 1000);
} catch (err) {
reject(err);
}
});
};

if (!RUNNER_SHUTTING_DOWN) {
const jobs = job
? [job]
: (await cml.pipelineJobs({ jobs: RUNNER_JOBS_RUNNING }))
.filter((job) => job.status === 'completed')
.map((job) => job.id);
const jobs = job ? [job] : await waitCompletedPipelineJobs();

RUNNER_JOBS_RUNNING = RUNNER_JOBS_RUNNING.filter(
(job) => !jobs.includes(job.id)
Expand All @@ -251,13 +274,22 @@ const runLocal = async (opts) => {
proc.stderr.on('data', dataHandler);
proc.stdout.on('data', dataHandler);
proc.on('uncaughtException', () => shutdown(opts));
proc.on('SIGINT', () => shutdown(opts));
proc.on('SIGTERM', () => shutdown(opts));
proc.on('SIGQUIT', () => shutdown(opts));
proc.on('disconnect', () => shutdown(opts));
proc.on('exit', () => shutdown(opts));

if (!noRetry) {
try {
console.log(`EC2 id ${await SpotNotifier.instanceId()}`);
SpotNotifier.on('termination', () => shutdown(opts));
SpotNotifier.start();
} catch (err) {
console.log('SpotNotifier can not be started.');
}
}

if (parseInt(idleTimeout) !== 0) {
const watcher = setInterval(() => {
RUNNER_TIMEOUT_TIMER >= idleTimeout &&
RUNNER_TIMEOUT_TIMER > idleTimeout &&
shutdown(opts) &&
clearInterval(watcher);

Expand Down
Loading