Skip to content

Master reboot can cause etcd/ etcd-events pods to not start up because of races #5578

Closed
@kanantheswaran-splunk

Description

@kanantheswaran-splunk

1. What kops version are you running? The command kops version, will display
this information.

Version 1.9.0

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.9.9 (we use this version because we really, really need this fix without which we encounter data plane outages when rolling masters)

3. What cloud provider are you using?

AWS (CoreOS AMI)

4. What commands did you run? What is the simplest way to reproduce this issue?

Masters rebooted at different times after successful initial set up. Since it seems to be an initialization race (see below), this cannot be reproduced at will.

5. What happened after the commands executed?

One master was trying to run both etcd pods (main and events), another was running just the main etcd and wasn't even trying to set up the events pod, and the third wasn't trying to run either.

When system was inspected, manifest files were found for both etcd and etcd-events pods.

6. What did you expect to happen?

Etcd main and events should have correctly started up on all masters.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

This is not relevant to the problem at hand.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

This is not relevant to the problem at hand.

9. Anything else do we need to know?

We believe there is a race condition between the actions that cause the etcd volumes to be mounted on the local filesystem and the kubelet being able to read and run the etcd manifest files.

Error messages we saw were as follows:

Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.378908    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd-events.manifest": stat /etc/kubernetes/manifests/etcd-events.manifest: no such file or directory
Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.379946    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd.manifest": stat /etc/kubernetes/manifests/etcd.manifest: no such file or directory

When masters are initially set up for a new cluster protokube has already mounted the etcd volumes, written the manifests and symlinked them under /etc/kubernetes/manifests before the kubelet service is started.

On reboot, protokube remounts the volumes but since the kubelet service is already present, it can start too soon before the volumes have been mounted. Thus it fails to resolve the symlinks and start one or both the etcd pods (based on which volumes have been mounted at that instant).

It also seems that there is no filesystem event after that point that will cause the kubelet to re-discover the fact that the etcd pods are not running and run them.

A simple solution might be to set up the kubelet service to not autostart on boot since protokube will start it anyway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions