Description
(Migrated from #37545 (comment) to improve visibility.)
The failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/ showed that it is possible for a request to do some ML operation for _all
can return an error that it could not find an entity it expected to find.
For example, closing _all
jobs might return an error that job foo
does not exist. Or stopping _all
datafeeds might return an error that datafeed bar
does not exist.
This seems completely crazy, as it's obvious that _all
should only include entities that exist.
The reason this can happen is that our actions involve multiple base level Elasticsearch actions chained together, and entities could be deleted in between these base level steps. For example:
- Alice requests force delete of job
foo
- Bob requests close
_all
jobs - Bob's request to close
_all
jobs expands_all
tofoo
andbar
- Alice's request to force delete
foo
removes the config associated with jobfoo
- Bob's request to close
_all
jobs attempts to find the config for jobfoo
- Bob's request to close
_all
fails because the config for jobfoo
does not exist
Although the test failure that highlighted this problem was a 6.5 test run, I suspect the problem is worse in 6.6 and above because expanding _all
requires a search for configs in an index rather than just looking in the (in-memory on all nodes) cluster state.
ML actions that operate on _all
should silently ignore failures to find entities from the original expansion of _all
, on the assumption that these entities have been deleted by a concurrent request.