Description
Elasticsearch Version
master (and more)
Installed Plugins
N/A
Java Version
bundled
OS Version
Cloud
Problem Description
Today TransportPutShutdownNodeAction
and TransportDeleteShutdownNodeAction
submit their cluster state update tasks at NORMAL
priority, which may mean that they take a long time to execute in an overloaded cluster. They also use the (now-forbidden) unbatched executor, and always generate a new cluster state update even if the change is a no-op. I think we should:
- Implement a batching executor, since the orchestrator will sometimes retry on a timeout. (Use batched executor for shutdown node actions #86018)
- Increase the priority to
URGENT
. These tasks look to be pretty lightweight, and it's important for orchestration that they run ASAP. (Use urgent priority for node shutdown cluster state update #85838) - Complete the listener without waiting for the follow-up reroute, so that we can ack that we processed the shutdown request quickly. I don't think real clients need to wait for the first reroute after the shutdown request was published; test clients might but we have other ways to wait for this in tests. (Ack node shutdown requests after cluster state update is complete #85846)
- Avoid publishing a cluster state if the update is a no-op. Can we even avoid submitting the task if it looks like we already marked the node for shutdown? (Add noop detection to node shutdown actions #85914)
- Bypass circuit breakers when manipulating shutdown metadata (in the REST and Transport layers). (Avoid circuit breaker trips in shutdown node actions #86047)