[ML] Datafeed assignment should cancel datafeeds with failed jobs

Currently if an ML job fails then we expect the datafeed to stop when it tries to send data to that job and receives an exception response.  However, this may not happen for a while after the job has failed, as the datafeed only sends data to it periodically.

If the node that the datafeed is running on is removed from the cluster while the job is in the `failed` state then there is a problem: the datafeed node assignment will refuse to assign it while its job is failed.  This leads to a datafeed that is in limbo, unable to be assigned to a new node and (due to #48931) unable to be force-stopped.

It would make more sense if the datafeed node assignment cancelled datafeeds whose jobs were in states where the datafeed would not work after reassignment, such as `closing` or `failed`.

There may be subtlety here, as the same node assignment code is used to generate error messages for the start_datafeed endpoint.  So we may need to find a way of telling the node assignment code whether it's being called on an initial assignment or a reassignment, and having it work slightly differently in the two cases:

1. For an initial assignment return an error message if the datafeed's job is in an unacceptable state
2. For a reassignment log an error and cancel the datafeed task if the datafeed's job is in an unacceptable state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Datafeed assignment should cancel datafeeds with failed jobs #48934

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Datafeed assignment should cancel datafeeds with failed jobs #48934

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions