Description
Today we quietly set a default of 30s for MasterNodeRequest#masterNodeTimeout
, relying on callers to override it where appropriate. However it is essentially always appropriate to override this timeout in production code. Forgetting to do so leads to subtle but critical bugs that only arise when a cluster is struggling to process cluster state updates as fast as normal, in which there is no way to lengthen these timeouts at runtime to bring the cluster back to health. There are several such bugs in Elasticsearch today (e.g. #107857, but I've found others too).
Instead in production code we should move away from having any quiet default for this timeout and instead require callers to specify an appropriate value when creating the request instance in the first place. For example:
- Instances that relate to a REST request should always set this timeout according to the
?master_timeout
request parameter, whose default ofRestUtils#REST_MASTER_TIMEOUT_DEFAULT
(30s) is explicitly documented and conceptually separate fromMasterNodeRequest#DEFAULT_MASTER_NODE_TIMEOUT
(also coincidentally 30s). - Instances that relate to internal activities should probably set this timeout very long since it's normally better to wait patiently instead of failing sooner, especially if the failure simply triggers a retry. Alternatively, they could expose the relevant timeout using a setting.
- In test code it doesn't matter so much, we can default to 30s unless the test specifically needs a different value, but there is no particular need to avoid passing this value on request construction either.
It's not a trivial change since there are currently over 200 implementations of MasterNodeRequest
, all of which currently rely on the default behaviour, so it's going to take a few steps to complete this work. Therefore I'm opening this meta-issue to track the work needed in this area.