Elasticsearch readiness probe might fail if a single node is stuck

It looks like we run `GET _cat/nodes?local` in order to check a pod's readiness:

https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L32

This API call does two things: it gets a list of nodes from the local cluster state, and then it reaches out to all the nodes in the cluster and obtains some extra info from them. If a single node in the cluster fails to respond in this second step within a few seconds then I think this means we consider all the pods to be failing their readiness checks.

I think it's a bug to accept `?local` on this API at all given this gotcha (see https://github.com/elastic/elasticsearch/issues/50088) but also think we should be using a different API for our readiness checks. For instance, `GET _cluster/health?timeout=0s` returns `200 OK` iff the node is part of a cluster with a master, and `GET /` returns (something) if the node is actually alive and responding.

(I say "something" because in certain versions `GET /` return `503 Service Unavailable` if there is no master, but more recent versions always return `200 OK` here).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch readiness probe might fail if a single node is stuck
#2248

DaveCTurner
openedon Dec 11, 2019

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Elasticsearch readiness probe might fail if a single node is stuck#2248

Description

DaveCTurneropenedon Dec 11, 2019

Metadata