Description
We have a 3 node etcd cluster that we used as a backend for a kubernetes cluster and on one of the nodes the data is inconsistent with the others:
Member list
etcdctl member list
76c74df0105143e4, started, etcd1, https://172.30.171.85:2380, https://172.30.171.85:2379
b4a97ffa7975df71, started, etcd2, https://172.30.173.252:2380, https://172.30.173.252:2379
bba515b5b42ffb5c, started, etcd0, https://172.30.167.81:2380, https://172.30.167.81:2379
Status
etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 1.2 GB, false, 2, 3115003
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 1.2 GB, true, 2, 3115003
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 851 MB, false, 2, 3115003
Data inconsistency
OK Node
etcdctl --endpoints https://172.30.167.81:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
/registry/deployments/datadog/datadog-agent-kube-state-metrics
Inconsistent Node: key is missing
etcdctl --endpoints https://172.30.173.252:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
Possible cause
We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.
etcd version: We were using a custom build from the 3.2 branch because 3.2.19 had not been released yet and we needed this PR: #9570
Our etcd was built from this commit: https://github.com/roboll/etcd/commit/d45053c068950a5672a22d1192249313dbcbca26 with go 1.10 (binary available here: https://github.com/roboll/etcd/releases/tag/v3.2.19-datadog). Even if this is not an official release we believe that this should not have happened.
We are keeping the cluster in this state to be able to diagnose what happened. We are happy to send more details.