Skip to content

[ML] Audit job open failures and stop any corresponding datafeed #80665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

droberts195
Copy link
Contributor

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
but not assigned to any node at that instant. This leads to the
situation where a datafeed is started but its job is not opened,
which is supposed to be impossible.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Nov 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

handler.accept(e);
}
});
communicator.writeUpdateProcessMessage(updateProcessMessage.build(), (aVoid, e) -> handler.accept(e));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lulz, accept null if null

Comment on lines 329 to 338
jobTask.updatePersistentTaskState(failedState, ActionListener.wrap(r -> {
logger.debug(() -> new ParameterizedMessage("[{}] updated task state to failed", jobTask.getJobId()));
stopAssociatedDatafeedForFailedJob(jobTask.getJobId());
}, e -> {
logger.error(
new ParameterizedMessage("[{}] error while setting task state to failed; marking task as failed", jobTask.getJobId()),
e
);
jobTask.markAsFailed(e);
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are blowing away the task here with jobTask.markAsFailed(e);

So, does it make sense to also stopAssociatedDatafeedForFailedJob(jobTask.getJobId()); when we failed to set the job to failed?

Otherwise, we have a datafeed task around but the job task is gone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure, I'll add stopAssociatedDatafeedForFailedJob here too.

I am not sure if it will make any difference in practice though, because the only likely reason for failing to update the persistent task status is if the node is shutting down, and if that's happening then stopping the associated datafeed will also fail.

Comment on lines 356 to 367
e -> {
if (autodetectProcessManager.isNodeDying() == false) {
logger.error(
new ParameterizedMessage(
"[{}] failed to stop associated datafeed [{}] after job failure",
jobId,
runningDatafeedId
),
e
);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of auditing a failure to stop?

TransportStopDatafeedAction already audits on stop success. I am just wondering if we should alert the user of a weird state.

@droberts195 droberts195 merged commit d95bca8 into elastic:master Nov 11, 2021
@droberts195 droberts195 deleted the fail_datafeed_if_job_open_fails branch November 11, 2021 20:39
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Nov 11, 2021
…stic#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes elastic#48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes elastic#80621
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.0
7.16

elasticsearchmachine pushed a commit that referenced this pull request Nov 11, 2021
) (#80677)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621
elasticsearchmachine pushed a commit that referenced this pull request Nov 12, 2021
…ed (#80665) (#80678)

* [ML] Audit job open failures and stop any corresponding datafeed (#80665)

The anomaly detection code contained an assumption dating back
to 2016 that if a job failed then its datafeed would notice and
stop itself. That works if the job fails on a node after it has
successfully started up. But it doesn't work if the job fails
during the startup sequence. If the job is being started for the
first time then the datafeed won't be running, so there's no
problem, but if the job fails when it's being reassigned to a
new node then it breaks down, because the datafeed is started
by not assigned to any node at that instant.

This PR addresses this by making the job force-stop its own
datafeed if it fails during its startup sequence and the datafeed
is started.

Fixes #48934

Additionally, auditing of job failures during the startup
sequence is moved so that it happens for all failure scenarios
instead of just one.

Fixes #80621

* Removing 8.0+ test that won't work on 7.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.16.0 v8.0.0-rc1 v8.1.0
Projects
None yet
6 participants