Skip to content

Elasticsearch readiness probe might fail if a single node is stuck

Closed

Description

It looks like we run GET _cat/nodes?local in order to check a pod's readiness:

https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L32

This API call does two things: it gets a list of nodes from the local cluster state, and then it reaches out to all the nodes in the cluster and obtains some extra info from them. If a single node in the cluster fails to respond in this second step within a few seconds then I think this means we consider all the pods to be failing their readiness checks.

I think it's a bug to accept ?local on this API at all given this gotcha (see elastic/elasticsearch#50088) but also think we should be using a different API for our readiness checks. For instance, GET _cluster/health?timeout=0s returns 200 OK iff the node is part of a cluster with a master, and GET / returns (something) if the node is actually alive and responding.

(I say "something" because in certain versions GET / return 503 Service Unavailable if there is no master, but more recent versions always return 200 OK here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

>bugSomething isn't workingSomething isn't workingv1.0.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions