Inconsistent revision and data occurs #13766

liuycsd · 2022-03-08T07:28:54Z

I'm running a 3-member etcd cluster in a testing environment,
and a k8s cluster with each kube-apiserver connecting to one etcd server via localhost.

etcd cluster: created and running with v3.5.1
k8s cluster: v1.22

I found a data inconsistent several days ago:
Some keys exists on node x and y but not on node z, and some on z but not on the other two, e.g. different pods list would be returned from different servers.
Some keys have different values between z and the others, and can be updated to another different value via the corresponding etcd endpoint, e.g. the kube-system/kube-controller-manager lease points to different pods on different servers and both pods can successfully update the lease via their corresponding local api-server and etcd.
While other keys, including some new created ones, are consistent.

Checking with etcdctl endpoint status, raft_term, leader, raftIndex and raftAppliedIndex are same, but revision and dbSize does not: revision on node z is smaller than the others (~700000 diff) and both revisions are increasing.

Checking with etcd-dump-logs, it seems all the 3 nodes are receiving the same raft logs.

I'm not sure how and when this happens.
But the nodes might be under loads and may run out of memory or disk spaces sometimes.
Checking and comparing db keys with bbolt keys SNAPSHOT key, and seaching node operating system logs for the revisions near where the difference starts, I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

The text was updated successfully, but these errors were encountered:

serathius · 2022-03-08T10:53:06Z

cc @ptabor

ptabor · 2022-03-08T12:44:13Z

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

serathius · 2022-03-08T13:15:19Z

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

liuycsd · 2022-03-08T15:23:37Z

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

IIRC, it was originally created as 3.5.1.

liuycsd · 2022-03-08T15:24:25Z

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

I didn't set --experimental-corrupt-check-time.

liuycsd · 2022-03-09T14:58:24Z

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

Checking the running server, current config is:
https serving and client and peer cert auth enabled
quota-backend-bytes
heartbeat-interval
election-timeout
auto-compaction-retention
All in environment variables and no command line arguments.

chaochn47 · 2022-03-11T18:27:21Z

Hi @liuycsd, thanks for the reporting this issue.

But the nodes might be under loads and may run out of memory or disk spaces sometimes.

I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

Looks like the data inconsistency triggers are

OOM
disk space full
lot of leader elections and switches during those several hours.

Can you help us better understand your cluster is set up?

I am wondering if you have a monitoring system in place for this etcd cluster so you concluded there were likely out of memory or full disk space issues? If so, can you provide those information?
Do you have periodic backup https://etcd.io/docs/v3.5/op-guide/recovery/#snapshotting-the-keyspace? It can have two benefits.
- Compare two db files before and after the incident across the etcd cluster to further troubleshoot
- Recover to a point-in-time good state where data was consistent

I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours.

Any logs will be useful for us to help troubleshoot

Kindly suggest:

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.
Use latest etcd released version v3.5.2 as it included one important data inconsistency bug fix 55c16df by comparing v3.5.2 and v3.5.1 v3.5.1...v3.5.2

wilsonwang371 · 2022-03-11T22:50:34Z

I suggest you try 3.5.2 and see if this problem can still happen. Thre was a data corruption issue fixed recently.

liuycsd · 2022-03-12T09:37:38Z

Thanks.

That's just a simple environment for testing, with limited resources. No monitoring or backup setup.
Just because someone deployed a heavy system using that k8s cluster and found it quite slow, after logging into those nodes, found it was out of resources and made some manually cleanup and kubectl get pods showed that it seems to works.
But when I'm using that cluster several days later I found the inconsistent and take a snapshot AFTER that. No previous backup found.

I try to reproduce it in another 3.5.1 cluster, but it didn't occur. I setup a low cpu and disk quota to simulate a overloaded system so some leader elections were triggered randomly, at the same time, keep running several etcdctl compare/put txn, but the revisions inconsistent didn't occur.

I may try v3.5.2 and setup some basic monitor and backup later to see whether or not it occurs again.

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

liuycsd · 2022-03-12T09:39:43Z

Sorry for the accident close

liuycsd · 2022-03-14T14:58:53Z

Thanks @chaochn47 @wilsonwang371 and everyone else.

After checking the log again, I found a panic did occur on node y, and another panic occurred on node x two minutes later. From the restart log after panic, the revision from loaded db file was near (~10000 later than) where the revision difference starts by comparing the db with bbolt keys. I don't know whether or not it panicked on node z because the logs were cleared.
It just panicked at kvsToEvents in mvcc.syncWatchers.

serathius · 2022-03-14T15:08:42Z

Can you share the stacktraces?

liuycsd · 2022-03-14T15:34:46Z

Can you share the stacktraces?

I don't have access to the cluster right now. I may paste it when it's available later.
IIRC, the stacktraces are like #12845.

wilsonwang371 · 2022-03-14T16:07:00Z

can you check if #13505 is the root cause?

liuycsd · 2022-03-16T08:55:54Z

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar.
On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

wilsonwang371 · 2022-03-16T20:28:10Z

You can use 3.5.2 and see if you can reproduce this issue. From the stack trace, it seems like #13505 fixed this issue and it is available in 3.5.2. @liuycsd

wilsonwang371 · 2022-03-22T03:02:25Z

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar.
On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

do you have any update on this ?

serathius · 2022-03-23T11:44:02Z

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

There is some performance overhead expected fro running it, I would recommend it to run every couple of hours.

serathius · 2022-03-23T11:57:57Z

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

wilsonwang371 · 2022-03-23T17:51:06Z

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

We found data inconsistency in our internal etcd cluster. This was found in 3.4 but this is happening more often after we start using 3.5 which has a better performance.

When this happens, a solution would be take down the error node for some time and bring it back again later. This will force the data resync inside the cluster and the problem will be gone.

However, when this happens, no alert or special message was given. It gives the SRE team a hard time to detect this. Luckily, they were able to dump the log and I found there is a panic within the time range that we found issue. Other than that, there is no other error messages discovered. The content of the panic is very similar to the one described in this thread. Basically, somewhere in kvsToEvents(), a panic was triggered.

With some investigation, we believed the panic should be an error either from reading boltdb or from reading memory. From my point of view, it should be a data corruption inside boltdb.

Later in #13505, we found that the boltdb was accessed without a lock protection. Without a proper lock held, we can corrupt boltdb data and this can definitely trigger a data corruption in etcd. That's why I believe this patch should be able to fix this data corruption issue.

PaulFurtado · 2022-03-24T06:16:21Z

We encountered this corruption in one kube etcd cluster running etcd 3.5.1 yesterday morning and then upgraded all of our clusters to 3.5.2 today because we were going to be doing a rolling-restart of the kube api servers, which causes dangerously high memory spikes on etcd.

We faced etcd OOMs in two of our clusters when performing this rolling update. Both of those etcd clusters were running 3.5.2 and both experienced this corruption. The kube symptom of this is that querying various objects returned different results, probably depending on which etcd server was hit. This resulted in errors when listing objects like:

Error from server: etcdserver: mvcc: required revision is a future revision

and cases where querying the same object multiple times in a row altered between a recent state and one from 10+ minutes before.

On the etcd side, the most obvious symptom was the wildly different database sizes between members.

On the first cluster we had faced this issue on yesterday when still running 3.5.1, we attempted a defrag on the single member that had a different DB size and the defrag repeated failed so we shut the member down and the kube cluster immediately returned to a happy state.
On the two clusters we faced this issue on today that were running 3.5.2, most of the members had different db sizes. Ex: one showing 1.7GB, two showing 1.6GB, two showing 1.8GB. defrags succeeded no problem but did not solve anything. We attempted to use the new --experimental-corrupt-check-time flag on all of them and restart, but it didn't find any corruption, and further defrags continued to succeed but not solve the issue. Only way to solve was to walk through each member and shut them down log enough that they did a full sync on startup.

So, I can confirm that corruption problems definitely still exist in 3.5.2, and that the experimental corruption check flag may need some more work to detect it.

The servers in the first two cases were i3en.2xlarge ec2 instances (8 CPU, 64GiB memory, high iops nvme SSDs). The servers in the final case were i3en.6xlarge (24CPU, 192GiB memory). For an idea of load, every member in that i3en.6xlarge cluster pretty consistently does 8-12gbps of etcd traffic, and the 2xlarge clusters around 3-5gbps. It seems like our kube workload reproduces it after just a couple of etcd OOMs, so a reproducer could possibly just be issuing kill -9 to the etcd members repeatedly while performing a stress test until the db sizes vary widely, although the system behavior under memory pressure could also be relevant (heavily thrashing io causing the server to come to a crawl and leader elections occurring just prior to OOM).

Please let me know if any additional information would be valuable for you here.

serathius · 2022-03-24T09:59:46Z

Thanks @PaulFurtado sharing, this is hugely informative. There are 3 important pieces of information:

Corruption issues are still present in v3.5.2
Corruption check cannot detect them
We know under what circumstances issue can be reproduced.

I will look into reproducing this myself, couple of questions about corruption check. It is possible that corruption check run, but don't report an issue due to high WAL latency of followers. To get more details about why it's not working can you:

Provide some of the corruption check logs. They can be recognized by their source file location be corrupt.go.
Can you try enabling -experimental-initial-corrupt-check. It enables corruption check when member joins existing cluster, this should allow to always check hash difference between members.

Would be also good to know if you have seen same problem with v3.4.X. If you have a environment for reproducing the issue it would be good to confirm that problem only affects v3.5. It would be also useful to eliminate that the problem is with any optional v3.5 feature, can you confirm provide etcd configuration you use?

PaulFurtado · 2022-03-24T21:04:38Z

@serathius sorry, I had mentioned the --experimental-corrupt-check-time flag, but I actually meant the --experimental-initial-corrupt-check flag. When we believed it to be corrupt, we enabled the flag and this is what was logged upon restart:

{"level":"info","ts":"2022-03-23T21:49:02.865Z","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"d9ccd33e39ffddb","timeout":"7s"}
{"level":"info","ts":"2022-03-23T21:49:06.242Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"d9ccd33e39ffddb"}

We have never experienced this issue with 3.4.x (or if we did, it must have been extremely subtle and never caused us pain). Under 3.4.x we experienced OOM kills constantly, and had faced so many outages due to OOMs so I would expect that we definitely would have noticed it on 3.4.x. We have been running on 3.5.x for a bit over a month now and this is the first time we hit an OOM condition and we immediately hit this corruption, so it definitely seems like a 3.5 issue.

These are the exact flags from one of the i3en.6xlarge servers:

--logger zap
--log-outputs stderr
--log-level info
--listen-peer-urls https://0.0.0.0:2380
--listen-client-urls https://0.0.0.0:2379
--advertise-client-urls https://172.16.60.166:2379
--data-dir /usr/share/hubspot/etcd/data
--peer-trusted-ca-file /etc/pki/tls/certs/etcd-peer-ca.crt
--peer-cert-file /etc/pki/tls/certs/etcd-peer.crt
--peer-key-file /etc/pki/tls/private/etcd-peer.key
--peer-client-cert-auth
--trusted-ca-file /etc/pki/tls/certs/etcd-client-ca.crt
--client-cert-auth
--cert-file /etc/pki/tls/certs/etcd-client.crt
--key-file /etc/pki/tls/private/etcd-client.key
--auto-compaction-retention 1h
--heartbeat-interval 100
--election-timeout 15000
--quota-backend-bytes 17179869184
--snapshot-count 100000
--max-request-bytes 15728640
--enable-v2
--experimental-peer-skip-client-san-verification

I also doubt that it's relevant, but we run with these env vars as well:

ETCD_NAME=na1-brave-street
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_INITIAL_CLUSTER=default=http://localhost:2380
ETCD_INITIAL_CLUSTER_TOKEN=kube-iad02
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://172.16.60.166:2380
GODEBUG=x509ignoreCN=0

I will also mention that we are running kubernetes 1.21 and we are aware that it is intended to talk to an etcd 3.4.x cluster. However, we upgraded to etcd 3.5 because it is impossible to run etcd with stable latency and memory usage on 3.4.x with kube clusters of our size. Is it possible that something about the 3.4.x client could be corrupting the 3.5.x server?

chaochn47 · 2022-03-25T06:47:51Z

I am not able to reproduce it with a 3 distributed etcd node cluster backing a kubernetes 1.22 cluster.

The reproduction set up is generating load from a standby node using benchmark put, range and txn-mixed. The 3 node keeps oom_reaped, recovered from the local disk and the revision is monotonically increasing across the cluster. The revision divergence is minimal and slow nodes will catch up eventually.

@PaulFurtado

Would you mind check one more time if creating 1.21 cluster with fresh v3.5.2 etcd cluster can reproduce the issue?
Does my reproduction mentioned above miss anything from your observation?

Thanks!

On my end, I will try to rerun the reproduction process with etcd cluster upgrading/downgrading between 3.5 and 3.4.

serathius · 2022-03-25T10:49:44Z

I think reproduced the issue, it was surprisingly easy when I knew where to look. Based on @PaulFurtado comment I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don't already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don't just test if database is correctly restored after unrecoverable error.

I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with --experimental-initial-corrupt-check, killed member fails to join with message checkInitialHashKV failed. This doesn't answer question how in @PaulFurtado case this check was not triggered, but should be enough to show there is a problem.

To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps.

With that I looked into using same method to test v3.4.18 release. I didn't managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn't test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X.

I'm sharing the code I used for reproduction here #13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function.

NOTE: This new version contains the fix for etcd-io/etcd#13766 so we no longer need the `--experimental-initial-corrupt-check=true` flag

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 8, 2022

serathius added the type/bug label Mar 8, 2022

liuycsd closed this as completed Mar 12, 2022

liuycsd reopened this Mar 12, 2022

ahrtr mentioned this issue Mar 23, 2022

Data inconsistency in etcd version 3.3.11 #13503

Closed

This was referenced Apr 15, 2022

[release-1.22] Bump etcd to v3.5.3 rancher/rke2#2783

Closed

Bump etcd to v3.5.3 k3s-io/k3s#5449

Closed

[release-1.22] bump etcd to v3.5.3 k3s-io/k3s#5450

Closed

stevo-f3 mentioned this issue Apr 28, 2022

etcd: Update to v3.5.4 kubernetes/kubernetes#109707

Closed

tsde mentioned this issue May 3, 2022

Update to RKE 1.3.11 rancher/terraform-provider-rke#340

Merged

brandond mentioned this issue May 4, 2022

Inconsistent revision and data after --force-new-cluster #14009

Closed

a-blender mentioned this issue May 19, 2022

Update Terraform Provider RKE to 1.3.11 rancher/terraform-provider-rke#342

Closed

TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Jul 27, 2022

build: Bump etcd version to 3.5.3-0

ce8c7ab

NOTE: This new version contains the fix for etcd-io/etcd#13766 so we no longer need the `--experimental-initial-corrupt-check=true` flag

TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Jul 27, 2022

build: Bump etcd version to 3.5.3-0

783f8a3

NOTE: This new version contains the fix for etcd-io/etcd#13766 so we no longer need the `--experimental-initial-corrupt-check=true` flag

TeddyAndrieux mentioned this issue Jul 27, 2022

Bump Kubernetes to 1.24.3 and etcd to 3.5.3-0 scality/metalk8s#3832

Merged

ahrtr mentioned this issue Aug 3, 2022

CHANGELOG: update version recommendation #14306

Merged

serathius mentioned this issue Aug 22, 2022

Introduce etcd linearizability tests #14045

Closed

33 tasks

serathius mentioned this issue Oct 14, 2022

Validate etcd linearizability #14398

Merged

serathius mentioned this issue Dec 25, 2022

Reproduce issue 13766 in linearizability tests #14682

Merged

serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023

tests: Add reproduce etcd-io#13766 scenario

ec61d58

serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023

tests: Add reproduce etcd-io#13766 scenario

4476f83

serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023

tests: Add reproduce etcd-io#13766 scenario

dd3b5f2

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023

tests: Add reproduce etcd-io#13766 scenario

0142978

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023

tests: Add reproduce etcd-io#13766 scenario

bb36751

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023

tests: Add reproduce etcd-io#13766 scenario

245d03f

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023

tests: Add reproduce etcd-io#13766 scenario

a52b29f

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023

tests: Add reproduce etcd-io#13766 scenario

287e1b6

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023

tests: Add reproduce etcd-io#13766 scenario

ca9de27

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023

tests: Add reproduce etcd-io#13766 scenario

a0d12d3

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

dims mentioned this issue Feb 24, 2023

possible Issues(?) bumping etcd 8 GB upper limit (or not!) #15354

Closed

lavacat mentioned this issue Apr 15, 2023

etcd discovery fails #15705

Closed

tjungblu mentioned this issue Jun 20, 2023

One of three nodes revision inconsistency cause the node compaction abnormal #16106

Closed

4 tasks

dberardo-com mentioned this issue Jul 11, 2023

Restore cluster with embedded etcd datastore snapshot failed k3s-io/k3s#5334

Closed

ahrtr mentioned this issue Oct 12, 2023

Get rid of OnPreCommitUnsafe and LockInsideApply & LockOutsideApply #16749

Open

serathius mentioned this issue Apr 11, 2024

Make robustness qps requirements less fragile to CI performance #17775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent revision and data occurs #13766

Inconsistent revision and data occurs #13766

liuycsd commented Mar 8, 2022 •

edited

Loading

serathius commented Mar 8, 2022

ptabor commented Mar 8, 2022

serathius commented Mar 8, 2022

liuycsd commented Mar 8, 2022

liuycsd commented Mar 8, 2022 •

edited

Loading

liuycsd commented Mar 9, 2022 •

edited

Loading

chaochn47 commented Mar 11, 2022 •

edited

Loading

wilsonwang371 commented Mar 11, 2022

liuycsd commented Mar 12, 2022

liuycsd commented Mar 12, 2022

liuycsd commented Mar 14, 2022

serathius commented Mar 14, 2022

liuycsd commented Mar 14, 2022

wilsonwang371 commented Mar 14, 2022

liuycsd commented Mar 16, 2022

wilsonwang371 commented Mar 16, 2022 •

edited

Loading

wilsonwang371 commented Mar 22, 2022

serathius commented Mar 23, 2022

serathius commented Mar 23, 2022

wilsonwang371 commented Mar 23, 2022

PaulFurtado commented Mar 24, 2022

serathius commented Mar 24, 2022 •

edited

Loading

PaulFurtado commented Mar 24, 2022

chaochn47 commented Mar 25, 2022 •

edited

Loading

serathius commented Mar 25, 2022 •

edited

Loading

Inconsistent revision and data occurs #13766

Inconsistent revision and data occurs #13766

Comments

liuycsd commented Mar 8, 2022 • edited Loading

serathius commented Mar 8, 2022

ptabor commented Mar 8, 2022

serathius commented Mar 8, 2022

liuycsd commented Mar 8, 2022

liuycsd commented Mar 8, 2022 • edited Loading

liuycsd commented Mar 9, 2022 • edited Loading

chaochn47 commented Mar 11, 2022 • edited Loading

wilsonwang371 commented Mar 11, 2022

liuycsd commented Mar 12, 2022

liuycsd commented Mar 12, 2022

liuycsd commented Mar 14, 2022

serathius commented Mar 14, 2022

liuycsd commented Mar 14, 2022

wilsonwang371 commented Mar 14, 2022

liuycsd commented Mar 16, 2022

wilsonwang371 commented Mar 16, 2022 • edited Loading

wilsonwang371 commented Mar 22, 2022

serathius commented Mar 23, 2022

serathius commented Mar 23, 2022

wilsonwang371 commented Mar 23, 2022

PaulFurtado commented Mar 24, 2022

serathius commented Mar 24, 2022 • edited Loading

PaulFurtado commented Mar 24, 2022

chaochn47 commented Mar 25, 2022 • edited Loading

serathius commented Mar 25, 2022 • edited Loading

liuycsd commented Mar 8, 2022 •

edited

Loading

liuycsd commented Mar 8, 2022 •

edited

Loading

liuycsd commented Mar 9, 2022 •

edited

Loading

chaochn47 commented Mar 11, 2022 •

edited

Loading

wilsonwang371 commented Mar 16, 2022 •

edited

Loading

serathius commented Mar 24, 2022 •

edited

Loading

chaochn47 commented Mar 25, 2022 •

edited

Loading

serathius commented Mar 25, 2022 •

edited

Loading