Skip to content

[ML] _all requests can suffer "job not found" errors #37959

Closed
@droberts195

Description

@droberts195

(Migrated from #37545 (comment) to improve visibility.)

The failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/ showed that it is possible for a request to do some ML operation for _all can return an error that it could not find an entity it expected to find.

For example, closing _all jobs might return an error that job foo does not exist. Or stopping _all datafeeds might return an error that datafeed bar does not exist.

This seems completely crazy, as it's obvious that _all should only include entities that exist.

The reason this can happen is that our actions involve multiple base level Elasticsearch actions chained together, and entities could be deleted in between these base level steps. For example:

  1. Alice requests force delete of job foo
  2. Bob requests close _all jobs
  3. Bob's request to close _all jobs expands _all to foo and bar
  4. Alice's request to force delete foo removes the config associated with job foo
  5. Bob's request to close _all jobs attempts to find the config for job foo
  6. Bob's request to close _all fails because the config for job foo does not exist

Although the test failure that highlighted this problem was a 6.5 test run, I suspect the problem is worse in 6.6 and above because expanding _all requires a search for configs in an index rather than just looking in the (in-memory on all nodes) cluster state.

ML actions that operate on _all should silently ignore failures to find entities from the original expansion of _all, on the assumption that these entities have been deleted by a concurrent request.

Metadata

Metadata

Assignees

Labels

:mlMachine learning>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions