Liveness Readyness probes #65

rymsha · 2020-06-03T18:46:40Z

Umbrella ticket to discuss and implement liveness/readyness probes correctly because current approach with app-alive
doesn't work well.

Current problems

app-alive reports "ready" too early : when Elasticsearch still replicates shards (yellow). It leads do data loses during rolling update.
app-alive reports "not alive" on ALL nodes, just because Elasticsearch cluster lost its master due to its restart. Currently it is XP bug, but we still seek for better "alive" check than "es cluster is not red"

gbbirkisson · 2020-06-04T15:55:27Z

Lets start by defining those 2 probes:

Readiness probe: used to decide when the pod is available for accepting traffic and also if it is safe to update the next pod during a rolling update.
Liveness probe: used to decide when to restart a pod due to it being stuck in undesired state, i.e. deadlock.

Lets try to list out the requirements. @rymsha you are probably more fit to do this than me, but I can try to start.

For pod to be READY, all of these conditions must be met:

(Only clustered deployments) When node is part of Hazelcast cluster
(Only clustered deployments) When node is part of Elasticsearch cluster
(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
- Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.
When essential components are running (Jetty etc...)
(Only frontend nodes) "Warmup" has been run on the node.
- This is something @sigdestad has mentioned but is not important for cluster stability.

For pod to be ALIVE, all of these conditions must be met:

Node is responsive
- Now the question becomes, responsive in what way?
Pod has at some point had the state READY
- This is so the pod will not hang in "NOT READY" state forever due to some failures in the initial startup.
Pod has been NOT READY for X minutes
- This could be dangerous if applied on all nodes, but if all nodes are NOT READY for X minutes, then I guess the cluster is FUBAR anyway.

These are the things I can think off the top of my head. What do you guys think?

rymsha · 2020-06-04T18:31:07Z

Now the question becomes, responsive in what way?

It is truly hard to find good official docs, but not hard to google good explanations for the statement:

Recall that a liveness-probe failure will result in the container being restarted. Unlike a readiness probe, it is not idiomatic to check dependencies in a liveness probe. A liveness probe should be used to check if the container itself has become unresponsive.

It feels like liveness probe was designated with node.js in mind (where deadlock would mean its single thread is stuck)

Anyway, app-alive is one of the worst choices to use for liveness probe. Practically we could skip liveness probe altogether.
We may end up dedicating an own jetty connector (extra TCP port) that will only serve live probe - that connector must not depend on HZ or ES or anything else availability

sigdestad · 2020-06-04T18:59:34Z

Not been into this discussion, but we need a check that can be customized i.e. to visit a list of URLs before a pod is opened for traffic.

rymsha · 2020-06-05T06:35:27Z

(Only frontend nodes) "Warmup" has been run on the node.
This is something @sigdestad has mentioned but is not important for cluster stability.

This maybe an overkill for "readyness", because

Readiness probes runs on the container during its whole lifecycle

Warm-up could be part of container initialisation, I guess.
N.B. None of Google cloud services I dealt with don't give anything fancy to mitigate so called "cold start". They just recommend to write applications "differently".

sigdestad · 2020-06-05T06:47:24Z

Thing is, that (unless infrastructure can provide a "slow start") hitting a cold node with serious traffic, the node may "hang" for several minutes until it is able to respond, the result a lot of unhappy users.

rymsha · 2020-06-05T07:21:01Z

Unfortunately there is no happiness probe in k8s yet.

@gbbirkisson ALIVE need not to list any READY factors due to the same note from official docs

Readiness probes runs on the container during its whole lifecycle

rymsha · 2020-06-05T07:35:17Z

(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.

This has an interesting idea. Dedicated master nodes are READY even when ES cluster is YELLOW. There is one potential problem though - we run applications on masters. Worse - elected master does most of background tasks.

Another problem is our incapability to distinguish between initial startup and ongoing readyness checks.
It is a feature available from k8s 1.16 https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe
And some important (IMO) features are not even implemented kubernetes/enhancements#950

gbbirkisson · 2020-06-05T08:21:48Z

I think we might be able to use the startup probe. 1.16 is available on GKE regular channel and in azure. So lets assume that is possible. That should help us a lot. Then we have:

startupProbe: Indicates whether the application within the Container is started. All other probes are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet kills the Container
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod (Also remember the rolling update issue)
livenessProbe: Indicates whether the Container is running. If the liveness probe fails, the kubelet kills the Container

So my attempt to revise the previous list would be something like:

For startup probe to succeed, all of these conditions must be met:

(Only clustered deployments) When node is part of Hazelcast cluster
(Only clustered deployments) When node is part of Elasticsearch cluster
Repositories have been initialized
When essential components are running (Jetty etc...)

For readiness probe to succeed, all of these conditions must be met:

(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
- Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.
(Only frontend nodes) "Warmup" has been run on the node.
- This is something @sigdestad has mentioned but is not important for cluster stability.

For liveness probe to succeed, all of these conditions must be met:

Node is responsive
- Now the question becomes, responsive in what way?

gbbirkisson · 2020-06-05T12:16:54Z

Just one question @rymsha . We want the new alive app (lets call it app-k8s-probe) to be started as soon as possible. Would it be sufficient to put it in the deploy folder, or do we have to put it in the system folders? It probably depends on es, hazelcast and jetty at least.

rymsha · 2020-06-05T12:46:41Z

Anything put into system/*/** folder will be deployed and started by launcher (ProvisionActivator class does this). And app will have to control its dependencies carefully.

Anything put into deploy folder is deployed and started when main services are started DeployDirectoryWatcher class does this). App will only get started when all of "standard" services are up.

So, just putting an app in this or that folder won't help much. It depends on what we want from that app and how we write it.

But I'm still trying to figure out this puzzle in my head. I believe app-k8s-probe should be available as late as possible, but not later. It sounds like opposite to what you wrote, but in practice it repeats what you meant.

rymsha · 2020-06-05T13:19:27Z

I'm not familiar with startup probe yet. Need to try it first. Hopefully it holds rolling update same way as readiness probe. If so then:

For startup probe to succeed, all of these conditions must be met:

(Only clustered deployments) When node is part of Hazelcast cluster
(Only clustered deployments) When node is part of Elasticsearch cluster
Repositories have been initialized (implies local Elasticsearch readiness)
When essential components are running (Jetty etc...)
(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
(Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)

For readiness probe to succeed, all of these conditions must be met:

local Elasticsearch node accepts requests https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-readiness.html
(Only frontend nodes) 8080 HTTP responds something well defined (WIP - this is where app-k8s-probe can be useful)

For liveness probe to succeed, all of these conditions must be met:

??? I still can't find the good reason to automatically kill pod. Anything that depends on jetty depends on hazelcast as well (for distributed sessions). So my idea about Jetty connector maybe not so good either. We should skip liveness probe for now. Leave it for non-legacy and fancy, but deadlock prone frameworks (like Node.js).

gbbirkisson · 2020-06-05T13:33:23Z

Ok we can drop the liveness probe for now. I think the checks ...

(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
(Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)

... should be in the readiness probe. Simply because the startup probe will kill the pods, and the readiness probe will not. So then we do not risk the startup probe killing a node that is replicating indexes.

gbbirkisson · 2020-06-05T13:58:46Z

I was thinking we could expose the app-k8s-probe on port 2609. Also I was thinking it would serve as an endpoint for all the probes. This is small example on how that could look like:

@Component
public class K8sProbeReporter
    implements StatusReporter
{
    @Override
    public String getName()
    {
        return "k8s.probes";
    }

    @Override
    public MediaType getMediaType()
    {
        return MediaType.PLAIN_TEXT_UTF_8;
    }

    @Override
    public void report( final StatusContext context )
        throws IOException
    {
        Optional<String> type = context.getParameter( "type" );

        switch ( type.orElseThrow( () -> new ProbeException( "Missing 'type' parameter" ) ) )
        {
            case "startup":
                doStartupCheck();
                break;
            case "readiness":
                doReadinessCheck();
                break;
            case "liveness":
                doLivenessCheck();
                break;
            default:
                throw new ProbeException( "Invalid 'type' parameter" );
        }

        context.getOutputStream().write( "OK".getBytes( StandardCharsets.UTF_8 ) );
    }

    private void doStartupCheck()
        throws ProbeException
    {
        // We know jetty is running because otherwise we would not be serving this request
        // Also because jetty is running, we know that hazelcast is OK

        checkIfReposAreInitialized(); // This should tell us elastic search is ready

        checkIfEssentialComponentsAreRunning();

        // No checks threw exception, exit ...
    }

    private void doReadinessCheck()
        throws ProbeException
    {
        if ( isClustered() && isDataNode() && thisNodeHasNeverSeenGreenState() )
        {
            checkIfClusterInGreenState();
        }

        checkIfLocalElasticSearchAcceptsRequests();

        if ( !isClustered || isFrontendNode() )
        {
            checkIfHasDoneWarmup();
            checkIf8080Responds();
        }

        // No checks threw exception, exit ...
    }

    private void doLivenessCheck()
    {
        // Jetty is responding, so no need to do other checks
    }

    @Override
    public void report( final OutputStream stream )
        throws IOException
    {
        throw new ProbeException( "This method should not be called!" );
    }

    public static class ProbeException
        extends IOException
    {
        public ProbeException( String message )
        {
            super( message );
        }
    }
}

rymsha · 2020-06-05T17:57:53Z

We want to wait for GREEN ES cluster state when pod just started - or data loss is almost guaranteed.
We don't want YELLOW ES cluster state to prevent front-end nodes to receive traffic - or unnecessary downtimes will always be with us.

I see how you attempt to solve it with checkIfClusterInGreenState() I just don't know how to implement it (yet)

rymsha · 2020-06-06T13:48:07Z

I think the checks ...
(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
(Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)
... should be in the readiness probe.

As you notice, I think opposite.
We could instead give a much greater timeout for startup probe (1 hour?). Startup probe should succeed only if cluster is GREEN (to prevent data loss on rolling update) .
Usually time to get GREEN cluster is short, because other nodes have shard replicas. So periodSeconds should be low.
But sometimes cluster gets shrunk and ES will seek for other places for replicas - sync can take log time and cluster will be YELLOW all that period. Fortunately startup probes don't run in such times.
In a way it is also better than infinite (and not configurable) timeout for readiness probe.

Warmup (Happiness probe):
XP is currently relies a lot on on-heap caches to reach performance demands. These caches are lazily (on-demand) loaded. That makes responses after clod-starts slow.

Nowadays software is made to give "good-enough" performance from the start, instead of "maximum possible" performance but after a few iterations.
Examples:

Elasticsearch deprecated Warmers.
AOT compilers are more and more preferred (GoLang, GraalVM Native Image, Kotlin Native)

So, my input on this: fix slow starts as much as possible, don't waste time figuring out how to write warmups and introducing even more entropy in our, already quite complex, cluster startup process.

sigdestad · 2020-06-06T18:45:40Z

If we could drop lazy loading of bundles in server distro, but keep it for SDK - initial performance might be better. Also, warmup could be really easy to achieve, Just invoke root entry for each configured vhost before starting to serve requests

gbbirkisson · 2020-06-08T07:54:28Z

We could instead give a much greater timeout for startup probe (1 hour?)

Lets imagine we have 3 new nodes and 1 fails during startup (early on, we have seen that can happen). The cluster will then be in a YELLOW state for an hour. No node will be "READY" since they are all technically in the startup waiting for that one node to join the cluster. They will then all be restarted at the same time when all their readiness probes fail simultaneously. This is not good.

Startup probe should succeed only if cluster is GREEN (to prevent data loss on rolling update)

If replication takes more time then the startup probe timeout, it could result in data loss. As you said it can take a long time, the solution is not to just increase the timeout up to huge amounts. Like I said earlier: "... the startup probe will kill the pods, and the readiness probe will not." So I think it is safer to check the replication the readiness probe. Remember the next node during rolling update will not bounce until the readiness probe says its OK.

But sometimes cluster gets shrunk and ES will seek for other places for replicas - sync can take log time and cluster will be YELLOW all that period. Fortunately startup probes don't run in such times.

This is highly dependent on the way we implement the readiness probe. In the code snippet I put in this thread, the readiness probe does not care if the cluster goes to a yellow sate. So in that case shrinking the cluster will not affect the LB.

rymsha · 2020-06-08T08:04:35Z

Lets imagine we have 3 new nodes and 1 fails during startup (early on, we have seen that can happen). The cluster will then be in a YELLOW state for an hour. No node will be "READY" since they are all technically in the startup waiting for that one node to join the cluster. They will then all be restarted at the same time when all their readiness probes fail simultaneously. This is not good.

1 out of 3 failed is surprisingly GREEN - because there will be master elected and all shards will find its replicas

rymsha · 2020-06-08T08:07:56Z

If replication takes more time then the startup probe timeout, it could result in data loss. As you said it can take a long time, the solution is not to just increase the timeout up to huge amounts. Like I said earlier: "... the startup probe will kill the pods, and the readiness probe will not." So I think it is safer to check the replication the readiness probe. Remember the next node during rolling update will not bounce until the readiness probe says its OK.

Killing a pod which is not yet considered started is not an issue - because there is certainly a full replica of it (it need to get data from somewhere). No data loss possible

sigdestad · 2020-06-08T10:56:57Z

Would be nice to do a meeting about this - go through all the steps and then agree on a solution?

gbbirkisson · 2020-06-10T15:29:13Z

The conclusion of the meeting was:

Master nodes must be in a single node group
Data nodes must be in a single node group
Only use readiness probe for now
- For data nodes: Cluster must be in a green state
- For non data nodes: Cluster must be in a yellow state

gbbirkisson · 2021-03-05T12:42:16Z

Now if you have:

3 master nodes
2 data/frontend nodes

During a rolling update, all the data nodes will become NOT READY because the cluster is yellow. This is desired behavior.

But this will yank all of the nodes from the lb, and that is not good. How to deal with this?

This will only happen with 1-2 data nodes but not with 3+.

Obviously this would not be a problem if you have dedicated frontend nodes.

vbradnitski · 2023-03-21T13:19:27Z

HealthCheck endpoint: enonic/xp#9199

gbbirkisson mentioned this issue Mar 5, 2021

Revisit healthcheck #130

Closed

gbbirkisson self-assigned this Mar 8, 2021

gbbirkisson added the question Further information is requested label May 11, 2022

gbbirkisson removed their assignment May 11, 2022

gbbirkisson mentioned this issue May 13, 2022

Add healthcheck endpoint to XP enonic/xp#9199

Closed

rymsha assigned vbradnitski Jun 23, 2022

vbradnitski assigned rymsha Feb 7, 2023

vbradnitski added this to the 1.0.0 milestone Feb 7, 2023

vbradnitski added the help wanted Extra attention is needed label Feb 7, 2023

vbradnitski added a commit that referenced this issue Mar 27, 2023

Liveness Readyness probes #65

feb1eaa

vbradnitski added a commit that referenced this issue Mar 27, 2023

Liveness Readyness probes #65

3c7094e

vbradnitski linked a pull request Mar 29, 2023 that will close this issue

Liveness Readyness probes #65 #355

Closed

alansemenov closed this as completed Mar 29, 2023

vbradnitski mentioned this issue Apr 5, 2023

Operator references cloud-utils app which is not public #347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness Readyness probes #65

Liveness Readyness probes #65

rymsha commented Jun 3, 2020

gbbirkisson commented Jun 4, 2020

rymsha commented Jun 4, 2020 •

edited

Loading

sigdestad commented Jun 4, 2020

rymsha commented Jun 5, 2020

sigdestad commented Jun 5, 2020

rymsha commented Jun 5, 2020

rymsha commented Jun 5, 2020 •

edited

Loading

gbbirkisson commented Jun 5, 2020

gbbirkisson commented Jun 5, 2020

rymsha commented Jun 5, 2020

rymsha commented Jun 5, 2020 •

edited

Loading

gbbirkisson commented Jun 5, 2020

gbbirkisson commented Jun 5, 2020

rymsha commented Jun 5, 2020 •

edited

Loading

rymsha commented Jun 6, 2020 •

edited

Loading

sigdestad commented Jun 6, 2020

gbbirkisson commented Jun 8, 2020

rymsha commented Jun 8, 2020 •

edited

Loading

rymsha commented Jun 8, 2020

sigdestad commented Jun 8, 2020

gbbirkisson commented Jun 10, 2020

gbbirkisson commented Mar 5, 2021

vbradnitski commented Mar 21, 2023

Liveness Readyness probes #65

Liveness Readyness probes #65

Comments

rymsha commented Jun 3, 2020

gbbirkisson commented Jun 4, 2020

rymsha commented Jun 4, 2020 • edited Loading

sigdestad commented Jun 4, 2020

rymsha commented Jun 5, 2020

sigdestad commented Jun 5, 2020

rymsha commented Jun 5, 2020

rymsha commented Jun 5, 2020 • edited Loading

gbbirkisson commented Jun 5, 2020

gbbirkisson commented Jun 5, 2020

rymsha commented Jun 5, 2020

rymsha commented Jun 5, 2020 • edited Loading

gbbirkisson commented Jun 5, 2020

gbbirkisson commented Jun 5, 2020

rymsha commented Jun 5, 2020 • edited Loading

rymsha commented Jun 6, 2020 • edited Loading

sigdestad commented Jun 6, 2020

gbbirkisson commented Jun 8, 2020

rymsha commented Jun 8, 2020 • edited Loading

rymsha commented Jun 8, 2020

sigdestad commented Jun 8, 2020

gbbirkisson commented Jun 10, 2020

gbbirkisson commented Mar 5, 2021

vbradnitski commented Mar 21, 2023

rymsha commented Jun 4, 2020 •

edited

Loading

rymsha commented Jun 5, 2020 •

edited

Loading

rymsha commented Jun 5, 2020 •

edited

Loading

rymsha commented Jun 5, 2020 •

edited

Loading

rymsha commented Jun 6, 2020 •

edited

Loading

rymsha commented Jun 8, 2020 •

edited

Loading