Closed
Description
After seeing the following in logs when cluster couldn't start itself or even start clean if all etcd
pods were killed:
level=warning msg="all etcd pods are dead." cluster-name=etcd-cluster cluster-namespace=default pkg=cluster
This situation is not recovered by etcd-operator
.
https://github.com/coreos/etcd-operator/blob/8347d27afa18b6c76d4a8bb85ad56a2e60927018/pkg/cluster/cluster.go#L248-L252
Researching further looks like there are quite a lot of cases when etcd-operator
can't recover itself:
- Fail the cluster when all etcd pods are dead and there is no way to recover. coreos/etcd-operator#1973
- How can the operator recover from self-hosted cluster disasters? coreos/etcd-operator#1559
- etcd-operator does not recover an etcd cluster if it loses quorum coreos/etcd-operator#1972
- EtcdCluster condition is incorrect coreos/etcd-operator#2044
Because this backend is needed just for short-lived coordination locks, consider switching to Redis
or even single-instance etcd
like it was before (#52)?