Inconsistent data in an etcd cluster

We have a 3 node etcd cluster that we used as a backend for a kubernetes cluster and on one of the nodes the data is inconsistent with the others:

*Member list*
```
etcdctl member list
76c74df0105143e4, started, etcd1, https://172.30.171.85:2380, https://172.30.171.85:2379
b4a97ffa7975df71, started, etcd2, https://172.30.173.252:2380, https://172.30.173.252:2379
bba515b5b42ffb5c, started, etcd0, https://172.30.167.81:2380, https://172.30.167.81:2379
```

*Status*
```
etcdctl endpoint status
https://172.30.167.81:2379, bba515b5b42ffb5c, 3.2.18+git, 1.2 GB, false, 2, 3115003
https://172.30.171.85:2379, 76c74df0105143e4, 3.2.18+git, 1.2 GB, true, 2, 3115003
https://172.30.173.252:2379, b4a97ffa7975df71, 3.2.18+git, 851 MB, false, 2, 3115003
```

*Data inconsistency*
OK Node
```
etcdctl --endpoints https://172.30.167.81:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
/registry/deployments/datadog/datadog-agent-kube-state-metrics
```
Inconsistent Node: key is missing
```
etcdctl --endpoints https://172.30.173.252:2379 get --prefix --keys-only /registry/deployments/datadog/datadog-agent-kube-state-metrics --consistency="l"
```

*Possible cause*
We manage our cluster with terraform and we upgraded it. The upgrade involved replacing the etcd instances but we kept the data and wal directories (on EBS drives on AWS) and the new nodes had the same IP as the initial ones and the same etcd version. However etcd was probably not cleanly shut down.

*etcd version*: We were using a custom build from the 3.2 branch because 3.2.19 had not been released yet and we needed this PR: https://github.com/coreos/etcd/pull/9570
Our etcd was built from this commit: https://github.com/roboll/etcd/commit/d45053c068950a5672a22d1192249313dbcbca26 with go 1.10 (binary available here: https://github.com/roboll/etcd/releases/tag/v3.2.19-datadog). Even if this is not an official release we believe that this should not have happened.

We are keeping the cluster in this state to be able to diagnose what happened. We are happy to send more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent data in an etcd cluster #9630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent data in an etcd cluster #9630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions