Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT`

Today we quietly set a default of 30s for `MasterNodeRequest#masterNodeTimeout`, relying on callers to override it where appropriate. However it is essentially _always_ appropriate to override this timeout in production code. Forgetting to do so leads to subtle but critical bugs that only arise when a cluster is struggling to process cluster state updates as fast as normal, in which there is no way to lengthen these timeouts at runtime to bring the cluster back to health. There are several such bugs in Elasticsearch today (e.g. #107857, but I've found others too).

Instead in production code we should move away from having any quiet default for this timeout and instead require callers to specify an appropriate value when creating the request instance in the first place. For example:

- Instances that relate to a REST request should always set this timeout according to the `?master_timeout` request parameter, whose default of `RestUtils#REST_MASTER_TIMEOUT_DEFAULT` (30s) is explicitly documented and conceptually separate from `MasterNodeRequest#DEFAULT_MASTER_NODE_TIMEOUT` (also coincidentally 30s).
- Instances that relate to internal activities should probably set this timeout very long since it's normally better to wait patiently instead of failing sooner, especially if the failure simply triggers a retry. Alternatively, they could expose the relevant timeout using a setting.
- In test code it doesn't matter so much, we can default to 30s unless the test specifically needs a different value, but there is no particular need to avoid passing this value on request construction either.

It's not a trivial change since there are currently over 200 implementations of `MasterNodeRequest`, all of which currently rely on the default behaviour, so it's going to take a few steps to complete this work. Therefore I'm opening this meta-issue to track the work needed in this area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT` #107984

DaveCTurner
openedon Apr 28, 2024

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remove MasterNodeRequest#TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT #107984

Description

DaveCTurneropenedon Apr 28, 2024

Metadata