Master reboot can cause etcd/ etcd-events pods to not start up because of races

**1. What `kops` version are you running? The command `kops version`, will display
 this information.**

Version 1.9.0

**2. What Kubernetes version are you running? `kubectl version` will print the
 version if a cluster is running or provide the Kubernetes version specified as
 a `kops` flag.**

1.9.9 (we use this version because we really, really need [this fix](https://github.com/kubernetes/kubernetes/pull/63492) without which we encounter data plane outages when rolling masters)

**3. What cloud provider are you using?**

AWS (CoreOS AMI)

**4. What commands did you run?  What is the simplest way to reproduce this issue?**

Masters rebooted at different times after successful initial set up. Since it seems to be an initialization race (see below), this cannot be reproduced at will.

**5. What happened after the commands executed?**

One master was trying to run both etcd pods (main and events), another was running just the main etcd and wasn't even trying to set up the events pod, and the third wasn't trying to run either.

When system was inspected, manifest files were found for both etcd and etcd-events pods.

**6. What did you expect to happen?**

Etcd main and events should have correctly started up on all masters.

**7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -o yaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.**

This is not relevant to the problem at hand.

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**

This is not relevant to the problem at hand.

**9. Anything else do we need to know?**

We believe there is a race condition between the actions that cause the etcd volumes to be mounted on the local filesystem and the kubelet being able to read and run the etcd manifest files.

Error messages we saw were as follows:

```
Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.378908    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd-events.manifest": stat /etc/kubernetes/manifests/etcd-events.manifest: no such file or directory
Aug 03 05:39:57 XXX.compute.internal kubelet[1173]: E0803 05:39:57.379946    1173 file.go:139] Can't get metadata for "/etc/kubernetes/manifests/etcd.manifest": stat /etc/kubernetes/manifests/etcd.manifest: no such file or directory
```

When masters are initially set up for a new cluster `protokube` has already mounted the etcd volumes,  written the manifests and symlinked them under `/etc/kubernetes/manifests` before the kubelet service is started. 

On reboot, protokube remounts the volumes but since the kubelet service is already present, it can start too soon before the volumes have been mounted. Thus it fails to resolve the symlinks and start one or both the etcd pods (based on which volumes have been mounted at that instant). 

It also seems that there is no filesystem event after that point that will cause the kubelet to re-discover the fact that the etcd pods are not running and run them.

A simple solution might be to set up the kubelet service to not autostart on boot since protokube will start it anyway. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master reboot can cause etcd/ etcd-events pods to not start up because of races #5578

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Master reboot can cause etcd/ etcd-events pods to not start up because of races #5578

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions