Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness Readyness probes #65

Closed
rymsha opened this issue Jun 3, 2020 · 23 comments
Closed

Liveness Readyness probes #65

rymsha opened this issue Jun 3, 2020 · 23 comments
Assignees
Labels
help wanted Extra attention is needed question Further information is requested
Milestone

Comments

@rymsha
Copy link
Contributor

rymsha commented Jun 3, 2020

Umbrella ticket to discuss and implement liveness/readyness probes correctly because current approach with app-alive
doesn't work well.

Current problems

  • app-alive reports "ready" too early : when Elasticsearch still replicates shards (yellow). It leads do data loses during rolling update.

  • app-alive reports "not alive" on ALL nodes, just because Elasticsearch cluster lost its master due to its restart. Currently it is XP bug, but we still seek for better "alive" check than "es cluster is not red"

@gbbirkisson
Copy link
Contributor

Lets start by defining those 2 probes:

  • Readiness probe: used to decide when the pod is available for accepting traffic and also if it is safe to update the next pod during a rolling update.
  • Liveness probe: used to decide when to restart a pod due to it being stuck in undesired state, i.e. deadlock.

Lets try to list out the requirements. @rymsha you are probably more fit to do this than me, but I can try to start.

For pod to be READY, all of these conditions must be met:

  • (Only clustered deployments) When node is part of Hazelcast cluster
  • (Only clustered deployments) When node is part of Elasticsearch cluster
  • (Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
    • Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.
  • When essential components are running (Jetty etc...)
  • (Only frontend nodes) "Warmup" has been run on the node.
    • This is something @sigdestad has mentioned but is not important for cluster stability.

For pod to be ALIVE, all of these conditions must be met:

  • Node is responsive
    • Now the question becomes, responsive in what way?
  • Pod has at some point had the state READY
    • This is so the pod will not hang in "NOT READY" state forever due to some failures in the initial startup.
  • Pod has been NOT READY for X minutes
    • This could be dangerous if applied on all nodes, but if all nodes are NOT READY for X minutes, then I guess the cluster is FUBAR anyway.

These are the things I can think off the top of my head. What do you guys think?

@rymsha
Copy link
Contributor Author

rymsha commented Jun 4, 2020

Now the question becomes, responsive in what way?

It is truly hard to find good official docs, but not hard to google good explanations for the statement:

Recall that a liveness-probe failure will result in the container being restarted. Unlike a readiness probe, it is not idiomatic to check dependencies in a liveness probe. A liveness probe should be used to check if the container itself has become unresponsive.

It feels like liveness probe was designated with node.js in mind (where deadlock would mean its single thread is stuck)

Anyway, app-alive is one of the worst choices to use for liveness probe. Practically we could skip liveness probe altogether.
We may end up dedicating an own jetty connector (extra TCP port) that will only serve live probe - that connector must not depend on HZ or ES or anything else availability

@sigdestad
Copy link
Member

Not been into this discussion, but we need a check that can be customized i.e. to visit a list of URLs before a pod is opened for traffic.

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

(Only frontend nodes) "Warmup" has been run on the node.
This is something @sigdestad has mentioned but is not important for cluster stability.

This maybe an overkill for "readyness", because

Readiness probes runs on the container during its whole lifecycle

Warm-up could be part of container initialisation, I guess.
N.B. None of Google cloud services I dealt with don't give anything fancy to mitigate so called "cold start". They just recommend to write applications "differently".

@sigdestad
Copy link
Member

Thing is, that (unless infrastructure can provide a "slow start") hitting a cold node with serious traffic, the node may "hang" for several minutes until it is able to respond, the result a lot of unhappy users.

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

Unfortunately there is no happiness probe in k8s yet.

@gbbirkisson ALIVE need not to list any READY factors due to the same note from official docs

Readiness probes runs on the container during its whole lifecycle

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.

This has an interesting idea. Dedicated master nodes are READY even when ES cluster is YELLOW. There is one potential problem though - we run applications on masters. Worse - elected master does most of background tasks.

Another problem is our incapability to distinguish between initial startup and ongoing readyness checks.
It is a feature available from k8s 1.16 https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe
And some important (IMO) features are not even implemented kubernetes/enhancements#950

@gbbirkisson
Copy link
Contributor

I think we might be able to use the startup probe. 1.16 is available on GKE regular channel and in azure. So lets assume that is possible. That should help us a lot. Then we have:

  • startupProbe: Indicates whether the application within the Container is started. All other probes are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet kills the Container
  • readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod (Also remember the rolling update issue)
  • livenessProbe: Indicates whether the Container is running. If the liveness probe fails, the kubelet kills the Container

So my attempt to revise the previous list would be something like:

For startup probe to succeed, all of these conditions must be met:

  • (Only clustered deployments) When node is part of Hazelcast cluster
  • (Only clustered deployments) When node is part of Elasticsearch cluster
  • Repositories have been initialized
  • When essential components are running (Jetty etc...)

For readiness probe to succeed, all of these conditions must be met:

  • (Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
    • Here I am trying to address the "Rolling Update" issue. This only applies during the the initial startup. So later during the pods lifecycle, this condition becomes irrelevant as far as this pod is concerned.
  • (Only frontend nodes) "Warmup" has been run on the node.
    • This is something @sigdestad has mentioned but is not important for cluster stability.

For liveness probe to succeed, all of these conditions must be met:

  • Node is responsive
    • Now the question becomes, responsive in what way?

@gbbirkisson
Copy link
Contributor

Just one question @rymsha . We want the new alive app (lets call it app-k8s-probe) to be started as soon as possible. Would it be sufficient to put it in the deploy folder, or do we have to put it in the system folders? It probably depends on es, hazelcast and jetty at least.

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

Anything put into system/*/** folder will be deployed and started by launcher (ProvisionActivator class does this). And app will have to control its dependencies carefully.

Anything put into deploy folder is deployed and started when main services are started DeployDirectoryWatcher class does this). App will only get started when all of "standard" services are up.

So, just putting an app in this or that folder won't help much. It depends on what we want from that app and how we write it.

But I'm still trying to figure out this puzzle in my head. I believe app-k8s-probe should be available as late as possible, but not later. It sounds like opposite to what you wrote, but in practice it repeats what you meant.

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

I'm not familiar with startup probe yet. Need to try it first. Hopefully it holds rolling update same way as readiness probe. If so then:

For startup probe to succeed, all of these conditions must be met:

  • (Only clustered deployments) When node is part of Hazelcast cluster
  • (Only clustered deployments) When node is part of Elasticsearch cluster
  • Repositories have been initialized (implies local Elasticsearch readiness)
  • When essential components are running (Jetty etc...)
  • (Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
  • (Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)

For readiness probe to succeed, all of these conditions must be met:

For liveness probe to succeed, all of these conditions must be met:

??? I still can't find the good reason to automatically kill pod. Anything that depends on jetty depends on hazelcast as well (for distributed sessions). So my idea about Jetty connector maybe not so good either. We should skip liveness probe for now. Leave it for non-legacy and fancy, but deadlock prone frameworks (like Node.js).

@gbbirkisson
Copy link
Contributor

Ok we can drop the liveness probe for now. I think the checks ...

  • (Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
  • (Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)

... should be in the readiness probe. Simply because the startup probe will kill the pods, and the readiness probe will not. So then we do not risk the startup probe killing a node that is replicating indexes.

@gbbirkisson
Copy link
Contributor

I was thinking we could expose the app-k8s-probe on port 2609. Also I was thinking it would serve as an endpoint for all the probes. This is small example on how that could look like:

@Component
public class K8sProbeReporter
    implements StatusReporter
{
    @Override
    public String getName()
    {
        return "k8s.probes";
    }

    @Override
    public MediaType getMediaType()
    {
        return MediaType.PLAIN_TEXT_UTF_8;
    }

    @Override
    public void report( final StatusContext context )
        throws IOException
    {
        Optional<String> type = context.getParameter( "type" );

        switch ( type.orElseThrow( () -> new ProbeException( "Missing 'type' parameter" ) ) )
        {
            case "startup":
                doStartupCheck();
                break;
            case "readiness":
                doReadinessCheck();
                break;
            case "liveness":
                doLivenessCheck();
                break;
            default:
                throw new ProbeException( "Invalid 'type' parameter" );
        }

        context.getOutputStream().write( "OK".getBytes( StandardCharsets.UTF_8 ) );
    }

    private void doStartupCheck()
        throws ProbeException
    {
        // We know jetty is running because otherwise we would not be serving this request
        // Also because jetty is running, we know that hazelcast is OK

        checkIfReposAreInitialized(); // This should tell us elastic search is ready

        checkIfEssentialComponentsAreRunning();

        // No checks threw exception, exit ...
    }

    private void doReadinessCheck()
        throws ProbeException
    {
        if ( isClustered() && isDataNode() && thisNodeHasNeverSeenGreenState() )
        {
            checkIfClusterInGreenState();
        }

        checkIfLocalElasticSearchAcceptsRequests();

        if ( !isClustered || isFrontendNode() )
        {
            checkIfHasDoneWarmup();
            checkIf8080Responds();
        }

        // No checks threw exception, exit ...
    }

    private void doLivenessCheck()
    {
        // Jetty is responding, so no need to do other checks
    }

    @Override
    public void report( final OutputStream stream )
        throws IOException
    {
        throw new ProbeException( "This method should not be called!" );
    }

    public static class ProbeException
        extends IOException
    {
        public ProbeException( String message )
        {
            super( message );
        }
    }
}

@rymsha
Copy link
Contributor Author

rymsha commented Jun 5, 2020

  • We want to wait for GREEN ES cluster state when pod just started - or data loss is almost guaranteed.
  • We don't want YELLOW ES cluster state to prevent front-end nodes to receive traffic - or unnecessary downtimes will always be with us.

I see how you attempt to solve it with checkIfClusterInGreenState() I just don't know how to implement it (yet)

@rymsha
Copy link
Contributor Author

rymsha commented Jun 6, 2020

I think the checks ...
(Only clustered deployments, Only data nodes) Elasticsearch cluster has replicated all shards.
(Only frontend nodes) "Warmup" has been run on the node. (??? happiness probe)
... should be in the readiness probe.

As you notice, I think opposite.
We could instead give a much greater timeout for startup probe (1 hour?). Startup probe should succeed only if cluster is GREEN (to prevent data loss on rolling update) .
Usually time to get GREEN cluster is short, because other nodes have shard replicas. So periodSeconds should be low.
But sometimes cluster gets shrunk and ES will seek for other places for replicas - sync can take log time and cluster will be YELLOW all that period. Fortunately startup probes don't run in such times.
In a way it is also better than infinite (and not configurable) timeout for readiness probe.

Warmup (Happiness probe):
XP is currently relies a lot on on-heap caches to reach performance demands. These caches are lazily (on-demand) loaded. That makes responses after clod-starts slow.

Nowadays software is made to give "good-enough" performance from the start, instead of "maximum possible" performance but after a few iterations.
Examples:

  • Elasticsearch deprecated Warmers.
  • AOT compilers are more and more preferred (GoLang, GraalVM Native Image, Kotlin Native)

So, my input on this: fix slow starts as much as possible, don't waste time figuring out how to write warmups and introducing even more entropy in our, already quite complex, cluster startup process.

@sigdestad
Copy link
Member

If we could drop lazy loading of bundles in server distro, but keep it for SDK - initial performance might be better. Also, warmup could be really easy to achieve, Just invoke root entry for each configured vhost before starting to serve requests

@gbbirkisson
Copy link
Contributor

We could instead give a much greater timeout for startup probe (1 hour?)

Lets imagine we have 3 new nodes and 1 fails during startup (early on, we have seen that can happen). The cluster will then be in a YELLOW state for an hour. No node will be "READY" since they are all technically in the startup waiting for that one node to join the cluster. They will then all be restarted at the same time when all their readiness probes fail simultaneously. This is not good.

Startup probe should succeed only if cluster is GREEN (to prevent data loss on rolling update)

If replication takes more time then the startup probe timeout, it could result in data loss. As you said it can take a long time, the solution is not to just increase the timeout up to huge amounts. Like I said earlier: "... the startup probe will kill the pods, and the readiness probe will not." So I think it is safer to check the replication the readiness probe. Remember the next node during rolling update will not bounce until the readiness probe says its OK.

But sometimes cluster gets shrunk and ES will seek for other places for replicas - sync can take log time and cluster will be YELLOW all that period. Fortunately startup probes don't run in such times.

This is highly dependent on the way we implement the readiness probe. In the code snippet I put in this thread, the readiness probe does not care if the cluster goes to a yellow sate. So in that case shrinking the cluster will not affect the LB.

@rymsha
Copy link
Contributor Author

rymsha commented Jun 8, 2020

Lets imagine we have 3 new nodes and 1 fails during startup (early on, we have seen that can happen). The cluster will then be in a YELLOW state for an hour. No node will be "READY" since they are all technically in the startup waiting for that one node to join the cluster. They will then all be restarted at the same time when all their readiness probes fail simultaneously. This is not good.

1 out of 3 failed is surprisingly GREEN - because there will be master elected and all shards will find its replicas

@rymsha
Copy link
Contributor Author

rymsha commented Jun 8, 2020

If replication takes more time then the startup probe timeout, it could result in data loss. As you said it can take a long time, the solution is not to just increase the timeout up to huge amounts. Like I said earlier: "... the startup probe will kill the pods, and the readiness probe will not." So I think it is safer to check the replication the readiness probe. Remember the next node during rolling update will not bounce until the readiness probe says its OK.

Killing a pod which is not yet considered started is not an issue - because there is certainly a full replica of it (it need to get data from somewhere). No data loss possible

@sigdestad
Copy link
Member

Would be nice to do a meeting about this - go through all the steps and then agree on a solution?

@gbbirkisson
Copy link
Contributor

The conclusion of the meeting was:

  • Master nodes must be in a single node group
  • Data nodes must be in a single node group
  • Only use readiness probe for now
    • For data nodes: Cluster must be in a green state
    • For non data nodes: Cluster must be in a yellow state

@gbbirkisson
Copy link
Contributor

Now if you have:

  • 3 master nodes
  • 2 data/frontend nodes

During a rolling update, all the data nodes will become NOT READY because the cluster is yellow. This is desired behavior.

But this will yank all of the nodes from the lb, and that is not good. How to deal with this?

This will only happen with 1-2 data nodes but not with 3+.

Obviously this would not be a problem if you have dedicated frontend nodes.

@gbbirkisson gbbirkisson self-assigned this Mar 8, 2021
@gbbirkisson gbbirkisson added the question Further information is requested label May 11, 2022
@gbbirkisson gbbirkisson removed their assignment May 11, 2022
@vbradnitski vbradnitski added this to the 1.0.0 milestone Feb 7, 2023
@vbradnitski vbradnitski added the help wanted Extra attention is needed label Feb 7, 2023
@vbradnitski
Copy link
Contributor

HealthCheck endpoint: enonic/xp#9199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants