Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Closed
aaronjzhang opened this issue Feb 6, 2023 · 44 comments · Fixed by #16822
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/bug

Comments

@aaronjzhang
Copy link

aaronjzhang commented Feb 6, 2023

What happened?

We have 3 etcd nodes which are running on k8s, the etcd pod use portworx storage for db saving, we've found when rebooting the portworx node, it will cause etcd to be stuck in fdatasync, if this etcd node is leader, then it will cause below 2 lease keep alive issues.

  1. the old leader still revokes lease after it steped to follower, this is the same issue as #12528

some clients connected to other etcd nodes and send keep alive to them, after the old leader steped to follower, the new leader will handle the lease keep alive, but the old leader's lessor still considered itself is the primary and revoked all leases after their ttl expired in its view, after the old leader returned from fdatasync, these revoke request will be still sent out when handling the next Ready.

  1. the old leader still handle lease keep alive request from the connected client.

some clients connected to the old leader etcd node, when the old leader was stuck in fdatasync, the raft layer steped to follower, but the lessor was not demoted, the connected clients can still send keep alive request to the old leader, at the same time the other two etcd nodes elected a new leader and didn't receive the lease keep alive message which should be sent to them but they were still sent to old leader, so these leases were revoked after the ttl expired.

Essentially the above 2 issues are because of the lessor primary is not synced up with the raft leader, there is big gap between lessor primary and raft leader when etcd server is stuck in processing raft Ready.

What did you expect to happen?

  1. The lease which is kept alive on old leader should return an error when its raft stepped to follower so that the clients can connect to other etcd nodes.
  2. The old leader should not revoke the other leases.

How can we reproduce it (as minimally and precisely as possible)?

  1. Add a piece of codes in func (w *WAL) sync() function to simulate fdatasync stuck
func (w *WAL) sync() error {
	if w.encoder != nil {
		if err := w.encoder.flush(); err != nil {
			return err
		}
	}
	start := time.Now()
	err := fileutil.Fdatasync(w.tail().File)

    //Simulate fdatasync is stuck
	if fileutil.Exist("/run/etcd/fdatasync_block") {
		f, err := ioutil.ReadFile("/run/etcd/fdatasync_block")
		if err == nil {
			str_num := strings.Trim(string(f), "\n")
			block_duration, err := strconv.Atoi(str_num)
			if err == nil {
				time.Sleep(time.Duration(block_duration) * time.Second)
			} else {
				if w.lg != nil {
					w.lg.Warn("Atoi error", zap.Error(err))
				}
			}
		}
	}

	took := time.Since(start)
	if took > warnSyncDuration {
		if w.lg != nil {
			w.lg.Warn(
				"slow fdatasync",
				zap.Duration("took", took),
				zap.Duration("expected-duration", warnSyncDuration),
			)
		} else {
			plog.Warningf("sync duration of %v, expected less than %v", took, warnSyncDuration)
		}
	}
	walFsyncSec.Observe(took.Seconds())

	return err
}
  1. Compile etcd with above changes.
  2. Apply the etcd patch in a cluster and restart them.
  3. Grant 3 leases with 10 seconds ttl.
  4. Keep different lease alive on different etcd node.
  5. Start to simulate fdatasync blocking on leader etcd with below commands:
mkdir /run/etcd; echo 60 > /run/etcd/fdatasync_block
  1. Put a kv to trigger fdatasync.
  2. Wait for etcd leader changing.
  3. Check the time to live of the lease on the new etcd leader which is kept alive on the old leader, you will see it's expired but the keeping alive is on going on the old leader.
  4. Wait for about 60 seconds, the other two leases will be revoked after the old leader is recovered.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.4.3
Git SHA: 3cf2f69b5
Go Version: go1.14
Go OS/Arch: linux/amd64




$ etcdctl version
etcdctl version: 3.4.3
API version: 3.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Solution and analysis (updated by @ahrtr)

@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2023
@ahrtr
Copy link
Member

ahrtr commented Feb 7, 2023

Thanks @aaronjzhang for raising this issue.

It looks like a major design issue to me. When the etcdserver is being blocked on the fdatasync, there is no chance for raft to notify etcdserver the leader changing.

I do not get time to deep dive into the new feature raft: support asynchronous storage writes, my immediate feeling is that it should can resolve such issue. @tbg @nvanbenschoten Could you double confirm this? But we also need to resolve it for etcd 3.5, which still depends on the old raft package.

Note you can reproduce this issue using the failpoint walBeforeSync or walAfterSync added in #15252. The steps to reproduce this issue are as below,

$ make gofail-enable
$ make build
$ goreman start   # Use the Procfile below
$ etcdctl lease grant 20
$ etcdctl lease keep-alive 3269862a9fd99a09  # Run this command in another terminal. The leaseid is returned in above command
$ curl http://127.0.0.1:2003/raftBeforeSave -XPUT -d'sleep("100s")'  # Note that you need to enable failpoint on the leader, assuming etcd3 is the leader
$ etcdctl put k1 v1   # wait about 100s, then you will see that the lease will be revoked

Procfile

etcd1: GOFAIL_HTTP="127.0.0.1:2001" bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr
etcd2: GOFAIL_HTTP="127.0.0.1:2002" bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr
etcd3: GOFAIL_HTTP="127.0.0.1:2003" bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr

@ahrtr
Copy link
Member

ahrtr commented Feb 7, 2023

Run make gofail-disable to cleanup the codebase.

@tbg
Copy link
Contributor

tbg commented Feb 9, 2023

do not get time to deep dive into the new feature etcd-io/raft#8, my immediate feeling is that it should can resolve such issue. @tbg @nvanbenschoten Could you double confirm this? But we also need to resolve it for etcd 3.5, which still depends on the old raft package.

It could help be more resilient to disk stalls, but it wasn't designed to do just that, and so this is just speculation. In CockroachDB we only use async log appends, but the log application is still in the same loop, so we don't have a ton of experience there either. Plus, we haven't shipped it. I don't understand exactly what the problem in etcd is that is described above, but it's probably similar to issues we have in CockroachDB when the disk stalls and turns into a grey failure. What we do there is detect disk stalls and fatal the process. We don't try to handle it gracefully.

@ahrtr
Copy link
Member

ahrtr commented Feb 9, 2023

What we do there is detect disk stalls and fatal the process.

Thanks @tbg . This seems to be the key point. Could you be a little more specific about this?

@tbg
Copy link
Contributor

tbg commented Feb 9, 2023

Sure - here's a small tour. I hope I'm not missing anything but basically

  • all logging is fallible: buffered and when it can't flush in time, exit the process1. This includes network logging (fluentd). We have observed grey failures due to all of these stalling.
  • our storage engine pebble detects stalled writes and exits the process2.
  • additionally, in the liveness layer in CRDB, we also write to our disks before extending a lease3; this is so that we are a bit more snappy in detecting badness but doesn't fundamentally change the behavior.
  • we perform end-to-end testing of disk stalls45.

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/5024b24907e9782b2c3b8eb9e097467b2570c671/pkg/util/log/file.go#L251-L268

  2. https://github.com/cockroachdb/cockroach/blob/844a370ceced8e02fd5293a8854de964edae0794/pkg/storage/pebble.go#L1124-L1153

  3. https://github.com/cockroachdb/cockroach/blob/a28aa6cbb4b0de3ff92b4dcc74e319d2e2a220c2/pkg/kv/kvserver/liveness/liveness.go#L1278-L1295

  4. https://github.com/cockroachdb/cockroach/blob/6ed2fb0f1e98b0bdbf6d0f100bef2c1bf5dbf6de/pkg/cmd/roachtest/tests/disk_stall.go

  5. https://github.com/cockroachdb/cockroach/blob/eec5a47f4762922674feaeac74666e4d63d2a247/pkg/cmd/roachtest/tests/failover.go#L631-L638

@ahrtr
Copy link
Member

ahrtr commented Feb 10, 2023

Thank you @tbg ! Will take a look later.

@chaochn47
Copy link
Member

#13527 is related.

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2023

Thanks @tbg for the input.

When there is something wrong on the storage, etcd may get stuck on syncing data on the storage files. When running into this situation, the etcd member may not be able to process any requests, even become unresponsive when it is being removed from the cluster, e.g issues/14338. If the blocked member is the leader, it may cause two issues,

  • Users may get two leaders, but actually there is only one valid leader. The old leader is getting stuck on stalled writes, so the raft has no chance to notify it that the leader has changed. See 14338#issuecomment-1216006545.
  • The old leader may unexpectedly revoke all leases when it recovers from stalled writes, and accordingly remove all keys bound to the leases. See issues/15247.

I am thinking to propose to terminate etcd process when it doesn’t finish syncing the data (WAL or boltdb) in a configured max duration (we can define two thresholds of duration, one for WAL and the other for boltDB). When the member starts again, it will not be elected as the leader anymore if the storage issue isn’t resolved, so the two issues mentioned above will not happen.

any immediate comments or concerns?
@mitake @ptabor @serathius @spzala @tbg
@liggitt @dims @lavalamp @wojtek-t @nikhita @neolit123 @chaochn47 @tjungblu @fuweid

If it looks overall good, then I will provide a detailed design doc.

@chaochn47
Copy link
Member

chaochn47 commented Feb 22, 2023

+1 on the proposal

The issue we saw was kube-apiserver watch still connects to the old member which was stuck in disk write and serving with no updates. Terminating the "stuck in disk writing" etcd process is the right thing to do IMO. It is now fulfilled by nanny agent but could have been done in etcd.

For data consistency risks:

  1. boltdb is less concerning as long as consistent_index did not persist before terminating. WAL replay will reapply to boltdb after restart.
  2. Not sure what's the side effect of "meant to be persisted" WAL entries are lost especially hard state (commit index). I think the safest way to mitigate this uncertainty is to fail fast after restart. This could be achieved by removing this "bad" member permanently.

In other hand, linearizability test already has equivalent failure injection point raftBeforeSave and beforeCommit to verify correctness.

@ahrtr
Copy link
Member

ahrtr commented Feb 23, 2023

I don't see any data consistency issue. I will clearly clarify & explain the details in the design doc.

Note that it's a critical issue if we see any data inconsistency in whatever situation, unless the storage hardware is broken.

@fuweid
Copy link
Member

fuweid commented Feb 23, 2023

+1 for the proposal. Let the old leader steps down and raises a alert. The admin operator can notice the cluster is not health, even if the old leader process is in D status and waiting for pid reaper.

@mitake
Copy link
Contributor

mitake commented Feb 23, 2023

+1 for @ahrtr 's proposal. I think the root cause is that some part of lease related state transition isn't controlled by Raft, so the old leader still handle lease keep alive request from the connected client. can happen I think. This is a related discussion: #7320 (note that the related behavior is quite subtle, I’m still checking). It might be valuable to move the entire state transition to Raft layer?

@ahrtr
Copy link
Member

ahrtr commented Feb 27, 2023

Thanks all for the feedback.

The first draft design doc: https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#

@tjungblu
Copy link
Contributor

tjungblu commented Mar 3, 2023

Great proposal, I see there's a lot of comments in the doc around the specificity of this problem. I agree with @serathius though, we should look at this more from a general failure PoV.

If it's not disk stalls, it's a slow network. Therefore I do like the split between detection and action (or remediation). Clearly we don't need to implement all of it to fix the issue at hand, but having something more generic in place will allow us to add more over time. Just to summarize some strategies that were proposed here or in the doc:

Stall Detection

Stall Remediation

  • backoff
  • step down leader, wait for next election result
  • fatal exit (supervised via liveness probes / process suicide)

@mitake
Copy link
Contributor

mitake commented Apr 16, 2023

I thought about another lightweight mitigation for the issue: let LeaseRevokeRequest have a field of Raft term when it was issued. This branch has a rough implementation: https://github.com/mitake/etcd/commits/lease-revoke-term

With the change, a node (which thinks it’s a leader) issues LeaseRevokeRequest including its term information. When other nodes apply the request, it compares the term in the request and its own term. If the term from request is older, it simply discard the request. I think it can be an adhoc mitigation for the probelm.

I tested the above branch with the fail point testing briefly (#15247 (comment)) and it seems to be working.

Pros:

  • Implementation is quite simple

Cons:

  • It will change WAL file format by the new field.
  • Using term information inside the state machine is unusual, need to be careful about it’s actually ok or not

I think similar thing can be done for lease renew by adding term information to LeaseKeepAliveRequest.


This approach is a very adhoc and dirty change so I’m not sure it’s worth trying. I think the watchdog approach will be good for being defensive for other issues caused by slow I/O (but this approach isn’t conflicting with the watchdog).

If you have comments, please let me know.

@serathius
Copy link
Member

Looking at the scope this has become a whole new feature for etcd to detect getting stuck on disk.

@ahrtr
Copy link
Member

ahrtr commented May 10, 2023

Looking at the scope this has become a whole new feature for etcd to detect getting stuck on disk.

It isn't a blocker for 3.4.26 and 3.5.9. The big challenge is to sort out how to prevent lease from being mistakenly revoked. I will have a deep dive into it once we have more confidence on bbolt.

@mitake 's comment above #15247 (comment) makes some sense, but there might have some potential compatibility issue. Will dig into it later as well.

@ahrtr
Copy link
Member

ahrtr commented Dec 19, 2023

Could anyone help to backport #16822 to 3.5 and 3.4?

@amit-rastogi
Copy link

Hi, I was looking for some good first issues to start with. I would like to pick this up if no one else is currently working on this. Please let me know.

@jmhbnz
Copy link
Member

jmhbnz commented Dec 29, 2023

/assign @amit-rastogi

Thanks Amit - Assigned to you to complete the backports, please let us know if you have any questions 🙏🏻

@yitian108
Copy link

yitian108 commented Jan 26, 2024

Not sure is it related to #17305 , @ahrtr ? But my k8s is standalone, not the cluster.
PS: the etcd version is 3.4.13

@ahrtr
Copy link
Member

ahrtr commented Jan 26, 2024

@yitian108

  • Both 3.4 and 3.5 have this issue as well, and we need to backport the fix to both of them.
  • What's your lease's TTL?
  • Could you double check whether you can reproduce the issue using main branch? We have already fixed this issue on main branch in Ignore old leader's leases revoking request #16822

@yitian108
Copy link

yitian108 commented Jan 29, 2024

Hi @ahrtr , sorry for the delay. Our lease's TTL is 15s. Sorry for the confusion, but the etcd server version is actually based on the image: FROM k8s.gcr.io/etcd:3.5.0. I will try to replace the image with gcr.io/etcd-development/etcd:v3.5.11 and test the case.

Unluckily, when I use gcr.io/etcd-development/etcd:v3.5.11 as the basic image, the command docker build reports
runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory, Perhaps I should consider using another image that includes the fixed functionality from the main branch.

@ahrtr
Copy link
Member

ahrtr commented Jan 29, 2024

exec: "/bin/sh": stat /bin/sh: no such file or directory

Please read etcd-io/website#640

I will try to replace the image with gcr.io/etcd-development/etcd:v3.5.11 and test the case.

We haven't backport the fix to 3.5 and 3.4 yet. Could you try the main branch?

@yitian108
Copy link

Thank you @ahrtr for supporting the guidelines. For the main branch, can I follow the steps as below to build the images?

  1. Clone the main branch:
    git clone https://github.com/etcd-io/etcd.git

  2. Build Amd64 architecture:
    ARCH=amd64 GOOS=linux make build

  3. Build the image:
    cd bin
    cp "../Dockerfile-release.amd64" Dockerfile
    docker build -f Dockerfile -t my/etcd:3.5.11

Alternatively, after finishing the first step and building the source code, I will generate the binary program 'etcd'. Then, I will replace the old binary program 'etcd' in the existing image 'k8s.gcr.io/etcd:3.5.0'.

These are the two ways that could be used for the main branch. If I am wrong, please correct me.

In general, where can we obtain the official image after backporting the fix to 3.5 and 3.4?

@ahrtr
Copy link
Member

ahrtr commented Jan 29, 2024

  • Clone the main branch:
    git clone https://github.com/etcd-io/etcd.git
  • Build Amd64 architecture:
    ARCH=amd64 GOOS=linux make build
  • Build the image:
    cd bin
    cp "../Dockerfile-release.amd64" Dockerfile
    docker build -f Dockerfile -t my/etcd:3.5.11

Yes, it should work.

In general, where can we obtain the official image after backporting the fix to 3.5 and 3.4?

etcd uses gcr.io/etcd-development/etcd as a primary container registry, and quay.io/coreos/etcd as secondary.

Please refer to https://github.com/etcd-io/etcd/releases

@yitian108
Copy link

Hi @ahrtr , unfortunately, when executing the build command ARCH=amd64 GOOS=linux make build, it shows:
which: no yamllint in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) which: no yamlfmt in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) which: no golangci-lint in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) GO_BUILD_FLAGS=" -v -mod=readonly" ./scripts/build.sh Running etcd_build % rm -f bin/etcd % (cd server && env CGO_ENABLED=0 GO_BUILD_FLAGS= -v -mod=readonly GOOS=linux GOARCH=amd64 go build -v -mod=readonly -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=58846bd8f -o=../bin/etcd .) stderr: ../client/v3/example_auth_test.go:1:1: expected 'package', found '.' stderr: ../client/v3/concurrency/example_election_test.go:1:1: expected 'package', found '.' FAIL: (code:1): % (cd server && env CGO_ENABLED=0 GO_BUILD_FLAGS= -v -mod=readonly GOOS=linux GOARCH=amd64 go build -v -mod=readonly -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=58846bd8f -o=../bin/etcd .) FAIL: etcd_build (GOARCH=amd64) make: *** [build] Error 2

It seems that some packages' version mismatched, as the following picture shown, the Go file is missing package declaration.
ps: the go version is 1.18.

image

@yitian108
Copy link

Hi @ahrtr , Above errors may come from the old go version, since I update go version, it appears to work.

For testing the issue further, I reproduce it as below steps:

  1. clone the main branch
    git clone https://github.com/etcd-io/etcd.git
  2. ARCH=amd64 GOOS=linux make build
  3. cd bin && cp ../Dockerfile .
  4. docker build -t my/etcd:3.5.11 .

Go through the above 4 steps, the image my/etcd:3.5.11 will be created. Since remove /bin/sh #15173 from the image, I couldn't build a new image based on my/etcd:3.5.11 again with customed shell script file. So, I just copy the 3 files: etcd/etcdctl/etcdutl to my old images k8s.gcr.io/etcd:3.5.0, then I based on the copied file image to build a new image, suppose the new image name as k8s.gcr.io/etcd:3.5.11, then I start to test the case again.

For testing the new image, I prepared for some test tasks:

  1. edit the shell script for writing etcd keys
#!/bin/bash

for i in {1..100000}
do
etcdctl --endpoints=xx.xxx.xxx.xx:14592 put "t$i" "This value from the key $i zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
sleep 0.01
done

  1. edit the shell script for reading etcd keys
#!/bin/bash

for i in {1..10000}
do
etcdctl --endpoints=xx.xxx.xxx.xx:14592 get "t$i"
sleep 0.01
done
  1. prepare rand-write with command fio
    fio -output=/tmp/100R100W -name=100R100W -filename=/dev/sda -ioengine=sync -direct=1 -blocksize=2M -size=10GB -rw=randwrite -iodepth=8 -numjobs=1
  2. prepare rand-read and rand-write with command fio
    fio -filename=/dev/sda -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=70 -ioengine=psync -bs=1m -size=10G -numjobs=10 -runtime=60 -group_reporting -name=mytest -ioscheduler=noop
  3. prepare watch the etcd log

To start test

  1. Open 5 terminal windows according to the preparation tasks.
  2. Executing the command of the step 3 & step 4 almost at the same time
  3. Executing the command of the step 1 & step 2 almost at the same time
  4. Watch the output of the step 1 & step 2 and watch the logs of the etcd server
    [Set ETCD Key Log]
OK
OK
OK
OK
OK
{"level":"warn","ts":"2024-01-31T16:54:29.302+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-d334eb87-ef03-4df8-878a-e2cbe0fa26a5/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
OK
{"level":"warn","ts":"2024-01-31T16:54:36.650+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6cdc8b1b-c76d-4f08-a080-1b19acb8766d/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
OK
{"level":"warn","ts":"2024-01-31T16:54:42.026+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-41a874b2-4f28-4b4b-bd26-0ec691d0f6a1/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:47.054+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-13d7317a-b213-41d8-be66-de6bcc5e03d3/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:52.115+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-302264df-fc43-437c-9736-8ceb464ad668/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:57.144+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-08d31c7b-2023-4dd5-91e8-7e4e641bc1d6/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

[Get ETCD Key Log]

t100
This value from the key 100 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
{"level":"warn","ts":"2024-01-31T16:54:29.292+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-040ed215-c525-4a4f-9273-43b3568ac8c8/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
t102
This value from the key 102 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
{"level":"warn","ts":"2024-01-31T16:54:36.649+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0aa2d963-61de-4985-9094-136e392faa14/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:41.675+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2546fd54-7af0-4083-9b3f-8b07283699d4/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:46.700+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-de4ee4b3-720e-4b6e-8303-6f9f6176e627/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:51.725+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e8d38d3b-abad-45b6-9a4b-8f892a59704b/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:56.747+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-f0b9361c-fa16-46b2-87f2-e6e572201bbe/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

[ETCD Server Log]

{"level":"warn","ts":"2024-01-31T16:54:24.855254+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:25.356242+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:25.856359+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:26.357513+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:26.857696+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:27.358825+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:27.859296+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:28.359429+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:29.303137+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:24.304546+0800","time spent":"4.998551022s","remote":"xx.xxx.xxx.xx:61995","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":0,"request content":"key:\"t6\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:31.018713+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"6.714036315s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:31.040403+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"6.746322608s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t101\" ","response":"","error":"rangeKeys: context cancelled: context canceled"}
{"level":"info","ts":"2024-01-31T16:54:31.604168+0800","caller":"traceutil/trace.go:172","msg":"trace[1927508231] range","detail":"{range_begin:t101; range_end:; }","duration":"7.310088226s","start":"2024-01-31T16:54:24.294057+0800","end":"2024-01-31T16:54:31.604145+0800","steps":["trace[1927508231] 'range keys from in-memory index tree'  (duration: 6.746222684s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.604277+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:24.29402+0800","time spent":"7.310222566s","remote":"xx.xxx.xxx.xx:61993","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":0,"response size":0,"request content":"key:\"t101\" "}
{"level":"warn","ts":"2024-01-31T16:54:31.223615+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:31.604363+0800","caller":"etcdserver/v3_server.go:909","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}
{"level":"warn","ts":"2024-01-31T16:54:31.604412+0800","caller":"etcdserver/v3_server.go:874","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":4822947687828599058,"received-request-id":4822947687828599055}
{"level":"warn","ts":"2024-01-31T16:54:31.355484+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:31.604447+0800","caller":"etcdserver/server.go:862","msg":"Ignore the lease revoking request because current member isn't a leader","local-member-id":8834979854806041326}
{"level":"info","ts":"2024-01-31T16:54:31.618665+0800","caller":"traceutil/trace.go:172","msg":"trace[1553294244] put","detail":"{key:t7; req_size:829; response_revision:9641; }","duration":"2.289589352s","start":"2024-01-31T16:54:29.329056+0800","end":"2024-01-31T16:54:31.618646+0800","steps":["trace[1553294244] 'process raft request'  (duration: 2.289500465s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.618762+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:29.329036+0800","time spent":"2.289699725s","remote":"xx.xxx.xxx.xx:35067","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":28,"request content":"key:\"t7\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:31.61895+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"2.302181338s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t102\" ","response":"range_response_count:1 size:851"}
{"level":"warn","ts":"2024-01-31T16:54:31.618989+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.992991745s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/1\" limit:32533 ","response":"range_response_count:1 size:63"}
{"level":"info","ts":"2024-01-31T16:54:31.619038+0800","caller":"traceutil/trace.go:172","msg":"trace[2020392476] range","detail":"{range_begin:/ms/group/1; range_end:; response_count:1; response_revision:9641; }","duration":"3.993042259s","start":"2024-01-31T16:54:27.625985+0800","end":"2024-01-31T16:54:31.619027+0800","steps":["trace[2020392476] 'agreement among raft nodes before linearized reading'  (duration: 3.992941947s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.619089+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:27.625961+0800","time spent":"3.993113588s","remote":"xx.xxx.xxx.xx:51975","response type":"/etcdserverpb.KV/Range","request count":0,"request size":17,"response count":1,"response size":86,"request content":"key:\"/ms/group/1\" limit:32533 "}
{"level":"warn","ts":"2024-01-31T16:54:31.619208+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.993586072s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/2\" limit:32575 ","response":"range_response_count:1 size:63"}
{"level":"info","ts":"2024-01-31T16:54:31.619001+0800","caller":"traceutil/trace.go:172","msg":"trace[1202781310] range","detail":"{range_begin:t102; range_end:; response_count:1; response_revision:9641; }","duration":"2.302243319s","start":"2024-01-31T16:54:29.316748+0800","end":"2024-01-31T16:54:31.618991+0800","steps":["trace[1202781310] 'agreement among raft nodes before linearized reading'  (duration: 2.30210135s)"],"step_count":1}
{"level":"info","ts":"2024-01-31T16:54:31.619243+0800","caller":"traceutil/trace.go:172","msg":"trace[2121909998] range","detail":"{range_begin:/ms/group/2; range_end:; response_count:1; response_revision:9641; }","duration":"3.993623372s","start":"2024-01-31T16:54:27.625612+0800","end":"2024-01-31T16:54:31.619235+0800","steps":["trace[2121909998] 'agreement among raft nodes before linearized reading'  (duration: 3.993537659s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.61928+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:29.316717+0800","time spent":"2.302540326s","remote":"xx.xxx.xxx.xx:35056","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":1,"response size":874,"request content":"key:\"t102\" "}
{"level":"warn","ts":"2024-01-31T16:54:31.619292+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:27.625568+0800","time spent":"3.993714699s","remote":"xx.xxx.xxx.xx:55357","response type":"/etcdserverpb.KV/Range","request count":0,"request size":17,"response count":1,"response size":86,"request content":"key:\"/ms/group/2\" limit:32575 "}
{"level":"info","ts":"2024-01-31T16:54:36.496593+0800","caller":"fileutil/purge.go:96","msg":"purged","path":"/data/etcd/member/snap/0000000000000002-0000000000004bc8.snap"}
{"level":"warn","ts":"2024-01-31T16:54:36.650245+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.99877509s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t103\" ","response":"","error":"context canceled"}
{"level":"info","ts":"2024-01-31T16:54:36.650336+0800","caller":"traceutil/trace.go:172","msg":"trace[948427076] range","detail":"{range_begin:t103; range_end:; }","duration":"4.998869058s","start":"2024-01-31T16:54:31.651442+0800","end":"2024-01-31T16:54:36.650311+0800","steps":["trace[948427076] 'agreement among raft nodes before linearized reading'  (duration: 4.998770603s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.650394+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.651411+0800","time spent":"4.998969357s","remote":"xx.xxx.xxx.xx:46974","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":0,"response size":0,"request content":"key:\"t103\" "}
{"level":"warn","ts":"2024-01-31T16:54:36.650946+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.651846+0800","time spent":"4.999080243s","remote":"xx.xxx.xxx.xx:46976","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":0,"request content":"key:\"t8\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:36.998188+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"5.377522133s","expected-duration":"100ms","prefix":"","request":"header:<ID:4822947687828599060 > lease_revoke:<id:42ee8d5e77509cdf>","response":"size:28"}
{"level":"info","ts":"2024-01-31T16:54:36.998396+0800","caller":"traceutil/trace.go:172","msg":"trace[900412583] linearizableReadLoop","detail":"{readStateIndex:19611; appliedIndex:19610; }","duration":"5.377666816s","start":"2024-01-31T16:54:31.620686+0800","end":"2024-01-31T16:54:36.998352+0800","steps":["trace[900412583] 'read index received'  (duration: 39.87µs)","trace[900412583] 'applied index is now lower than readState.Index'  (duration: 5.377625784s)"],"step_count":2}
{"level":"info","ts":"2024-01-31T16:54:36.998599+0800","caller":"traceutil/trace.go:172","msg":"trace[764521200] put","detail":"{key:t9; req_size:829; response_revision:9643; }","duration":"321.174018ms","start":"2024-01-31T16:54:36.677411+0800","end":"2024-01-31T16:54:36.998585+0800","steps":["trace[764521200] 'process raft request'  (duration: 321.119933ms)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.998652+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"5.37867209s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/\" range_end:\"/ms/group0\" ","response":"range_response_count:2 size:119"}
{"level":"info","ts":"2024-01-31T16:54:36.998702+0800","caller":"traceutil/trace.go:172","msg":"trace[1755538231] range","detail":"{range_begin:/ms/group/; range_end:/ms/group0; response_count:2; response_revision:9643; }","duration":"5.378739001s","start":"2024-01-31T16:54:31.619954+0800","end":"2024-01-31T16:54:36.998693+0800","steps":["trace[1755538231] 'agreement among raft nodes before linearized reading'  (duration: 5.378594853s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.998745+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.619903+0800","time spent":"5.378829516s","remote":"xx.xxx.xxx.xx:51975","response type":"/etcdserverpb.KV/Range","request count":0,"request size":24,"response count":2,"response size":142,"request content":"key:\"/ms/group/\" range_end:\"/ms/group0\" "}
{"level":"warn","ts":"2024-01-31T16:54:36.998663+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:36.677403+0800","time spent":"321.241496ms","remote":"xx.xxx.xxx.xx:56786","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":28,"request content":"key:\"t9\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:39.430229+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"2.400318935s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:41.675186+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.99863799s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t104\" ","response":"","error":"context canceled"}

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long, etc. Does it mean the issue still exists?

PS: I am not sure only copy 3 files etcd/etcdctl/etcdutl enough to test and the etcd pod's info as below:

# kubectl exec -it -n puc etcd-25-5556ccfc8f-8rc89 -- etcd --version
etcd Version: 3.6.0-alpha.0
Git SHA: 58846bd8f
Go Version: go1.21.6
Go OS/Arch: linux/amd64

@ahrtr
Copy link
Member

ahrtr commented Jan 31, 2024

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long, etc. Does it mean the issue still exists?

No, it doesn't reproduce the issue. Please refer to

@yitian108
Copy link

That means what I raised is another issue?

@ahrtr
Copy link
Member

ahrtr commented Feb 1, 2024

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long,

It isn't an issue. It just complained the performance of the disk and/or network communication between peers.

@yitian108
Copy link

yitian108 commented Feb 2, 2024

Hi @ahrtr , we can also see the logs:

{"level":"warn","ts":"2024-01-31T16:54:54.492078+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.491377+0800","time spent":"7.000695668s","remote":"100.103.215.183:43860","response type":"/etcdserverpb.Lease/LeaseRevoke","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.525051+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.524069+0800","time spent":"7.000975915s","remote":"100.103.215.201:40578","response type":"/etcdserverpb.Lease/LeaseRevoke","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.553334+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.552291+0800","time spent":"7.00103091s","remote":"100.103.215.163:52356","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.988081+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.986846+0800","time spent":"7.001222065s","remote":"xx.xxx.xxx.xx:45079","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:55.039296+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.038347+0800","time spent":"7.000941899s","remote":"100.103.215.202:58456","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
2024/01/31 16:54:58 WARNING: [core] [Server #3] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
{"level":"warn","ts":"2024-01-31T16:54:55.086478+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.085406+0800","time spent":"7.00106384s","remote":"100.103.215.205:56334","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:55.716188+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.71513+0800","time spent":"7.001051226s","remote":"xx.xxx.xxx.xx:53463","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:56.350418+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:51.792424+0800","time spent":"4.557982024s","remote":"100.103.215.202:44474","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
2024/01/31 16:54:58 WARNING: [core] [Server #3] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
{"level":"warn","ts":"2024-01-31T16:54:56.361035+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"6.81862586s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:56.366378+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.372258612s","expected-duration":"100ms","prefix":"","request":"header:<ID:4822947687828599072 > lease_revoke:<id:42ee8d5e774db0c7>","response":"size:28"}
{"level":"warn","ts":"2024-01-31T16:54:56.543459+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543452+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.54347+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543483+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543502+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543505+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543505+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543523+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543517+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543524+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543529+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}

During this period, all services that depend on the etcd service were restarted. I picked the log from one of these services:

{"level":"warn","ts":"2024-01-31T16:54:44.043+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:45.720+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-51a48217-3b7d-43be-a7e8-5ea185f5b18f/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:51.070+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:52.043+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-01-31 16:54:52.043647 I | role callback leader false
2024-01-31 16:54:52.043689 I | recv service active exit

I can understand it was related to disk performance, but due to etcd and all services deployed on the same server, there should not be a network issue. This should be a problem caused by high I/O, which led to etcd missing the lease and consequently causing all services to be restarted.

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2024

@ivanvc Please add a change for both 3.4 and 3.5, once it gets merged, then we can close this ticket. thx

Regarding #15247 (comment), I don't see an easy & cheap solution for now. I expect the Livez feature can workaround it.

@ahrtr
Copy link
Member

ahrtr commented Feb 23, 2024

All done. Thanks @ivanvc for backporting the fix to 3.5 and 3.4!

@ahrtr ahrtr closed this as completed Feb 23, 2024
ti-chi-bot bot pushed a commit to tikv/pd that referenced this issue Apr 7, 2024
ref #5520, close #8018

Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247

Signed-off-by: dongmen <414110582@qq.com>
ti-chi-bot bot pushed a commit to tikv/pd that referenced this issue May 22, 2024
ref #5520, close #8018

Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247

Signed-off-by: husharp <jinhao.hu@pingcap.com>

Co-authored-by: husharp <jinhao.hu@pingcap.com>
Co-authored-by: lhy1024 <liuhanyang@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/bug
Development

Successfully merging a pull request may close this issue.