Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher (6x) delay for 3.5.0 client to connect to cluster compared to 3.4 client #13240

Closed
acastong opened this issue Jul 22, 2021 · 9 comments
Closed

Comments

@acastong
Copy link

acastong commented Jul 22, 2021

Problem description

When a 3.5.0 client connects to a 3.4.16 server, the delay is much longer and increases from ~0.2 seconds in 3.4 client to 1.2s and more using 3.5.0 client. Here is the delay we were seeing with the 3.4.9 client:

$ ./etcdctl version
etcdctl version: 3.4.9
API version: 3.4
$ time ./etcdctl --endpoints=172.29.208.23:30859 get foo
foo
bar

real    0m0.179s
user    0m0.007s
sys     0m0.013s

And here is the delay we are seeing with the 3.5.0 client:

$ ./etcdctl_3.5 version
etcdctl version: 3.5.0
API version: 3.5

$ time ./etcdctl_3.5 --endpoints=172.29.208.23:30859 get foo
foo
bar
[values.yaml.txt](https://github.com/etcd-io/etcd/files/6864934/values.yaml.txt)


real    0m1.190s
user    0m0.027s
sys     0m0.050s

When in a Go application using the 3.5.0 client library, the default timeout of 2000ms is often not sufficient anymore and we can get context deadline exceeded from time to time. Moving the timeout up to 3000ms seems to work, but there is significant delays when making requests.

How to reproduce

Startup a 3.4.16 server on a Kubernetes cluster using helm install my-etcd -f values.yaml bitnami/etcd with attached values.yaml. This is a simple 3-node etcd without authentication enabled, but that is only exposed through a single service.

Use a 3.4.x and 3.5.0 etcdtl to connect to your etcd cluster to measure the time required to get a response.

Please note that I also reproduce with a more traditional 3.4.x server launched on 3 distinct reachable VMs with simple authentication enabled. In this case getting the endpoint status can take over 3 seconds!

$ time ./etcdctl --endpoints=172.29.86.134:2379 endpoint status --user=myuser --password=mypass
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, false, false, 1038, 3446827, 3446827,

real    0m0.296s
user    0m0.012s
sys     0m0.025s

$ time ./etcdctl_3.5 --endpoints=172.29.86.134:2379 endpoint status --user=myuser --password=mypass
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, false, false, 1038, 3446769, 3446769,

real    0m3.175s
user    0m0.054s
sys     0m0.112s
@acastong
Copy link
Author

I just upgraded my cluster to version 3.5 and it did not improve the delay, so this high delay problem does not seem limited to connecting to a cluster of a previous version. I will update the issue description accordingly.

$ time ./etcdctl_3.5 --endpoints=172.29.208.23:30859 endpoint status
172.29.208.23:30859, 9cd5e4dd11d83bef, 3.5.0, 20 kB, false, false, 393, 226, 226,

real    0m1.480s
user    0m0.024s
sys     0m0.058s

$ time ./etcdctl_3.5 --endpoints=172.29.208.23:30859 get foo
foo
bar

real    0m1.483s
user    0m0.025s
sys     0m0.047s

@acastong acastong changed the title High delay for 3.5.0 client to connect to a 3.4 server Higher (6x) delay for 3.5.0 client to connect to cluster compared to 3.4 client Jul 26, 2021
@lilic
Copy link
Contributor

lilic commented Jul 26, 2021

Thanks for the issue, sounds like a large delay. How often have you run the above tests? Average might be more useful, than a one off. I will try it out to reproduce as well!

Did you by any chance try with direct client access, or just with the CLI tool?

@acastong
Copy link
Author

I first saw the problem with the golang 3.5.0 client. We had a default timeout of 500ms and since upgrade, were seeing context deadline exceeded errors that we did not see with the 3.3.x client. My first thought was that this was the way we were initializing the client, but since the etcdctl application is also impacted it is most likely something more fundamental.

Here is a more statistically significant 30 runs results with the 3.5.0 version of etcdctl:

 Performance counter stats for './etcdctl_3.5 --endpoints=172.29.208.23:30859 endpoint status' (30 runs):

             79.96 msec task-clock:u              #    0.065 CPUs utilized            ( +-  5.75% )

            # Table of individual measurements:
            1.8371 (+0.6154) #######
            2.0122 (+0.7905) ########
            1.1139 (-0.1077) ##
            1.0626 (-0.1591) ###
            1.0593 (-0.1624) ####
            1.0623 (-0.1593) ###
            1.3111 (+0.0894) ##
            1.0582 (-0.1635) ####
            1.0614 (-0.1603) ####
            1.0592 (-0.1625) ####
            2.1483 (+0.9266) #########
            1.2184 (-0.0033) #
            1.0620 (-0.1597) ####
            1.7222 (+0.5005) ######
            1.0217 (-0.2000) ####
            1.0107 (-0.2110) #####
            1.0226 (-0.1990) ####
            0.5129 (-0.7088) ############################
            1.0567 (-0.1649) ####
            1.0624 (-0.1593) ###
            1.0634 (-0.1582) ###
            1.0760 (-0.1457) ###
            1.0540 (-0.1676) ####
            1.0585 (-0.1632) ####
            2.0678 (+0.8462) #########
            1.6154 (+0.3937) #####
            1.0567 (-0.1649) ####
            1.3050 (+0.0834) ##
            1.0165 (-0.2052) #####
            0.8616 (-0.3601) #########

            # Final result:
            1.2217 +- 0.0692 seconds time elapsed  ( +-  5.67% )

And with the 3.4.9 version of etcdctl:

 Performance counter stats for './etcdctl --endpoints=172.29.208.23:30859 endpoint status' (30 runs):

             37.97 msec task-clock:u              #    0.157 CPUs utilized            ( +-  8.13% )

            # Table of individual measurements:
            0.3970 (+0.1550) ########
            0.2068 (-0.0352) ####
            0.1859 (-0.0562) #######
            0.1757 (-0.0664) ########
            0.1838 (-0.0583) #######
            0.1817 (-0.0604) #######
            0.1880 (-0.0541) ######
            0.1842 (-0.0579) #######
            0.1969 (-0.0451) #####
            0.1885 (-0.0535) ######
            0.2252 (-0.0169) ##
            0.2113 (-0.0307) ###
            0.2329 (-0.0092) #
            0.2279 (-0.0141) ##
            0.2279 (-0.0141) ##
            0.2051 (-0.0370) ####
            0.1843 (-0.0578) #######
            0.1895 (-0.0525) ######
            0.2314 (-0.0107) #
            0.2213 (-0.0208) ##
            0.1974 (-0.0447) #####
            0.2234 (-0.0187) ##
            0.1883 (-0.0537) ######
            0.1940 (-0.0481) #####
            0.1837 (-0.0584) #######
            0.1918 (-0.0503) ######
            0.2127 (-0.0293) ###
            0.1811 (-0.0609) #######
            1.2155 (+0.9735) #################
            0.2288 (-0.0133) ##

            # Final result:
            0.2421 +- 0.0344 seconds time elapsed  ( +- 14.19% )

@hexfusion
Copy link
Contributor

hexfusion commented Jul 26, 2021

couple questions

  • were these binaries downloaded or compiled locally.
  • can you show golang version.

$ etcd --version

@lilic
Copy link
Contributor

lilic commented Jul 26, 2021

I can't seem to reproduce this. I have tried with etcd running locally as well as etcd running in my infra and both seem to be fast enough for me not to notice any latency. Note this is with etcd 3.5.0 with both in cluster and etcdctl with 3.5.0. Ran 10 times or so and this is always the result:

./bin/etcdctl endpoint status --endpoints=x:2379  0.03s user 0.02s system 27% cpu 0.171 total
./bin/etcdctl get --prefix 2021 --endpoints=x:2379  0.04s user 0.04s system 26% cpu 0.275 total

Note mine are all complied locally with go 1.16:

./bin/etcd --version
etcd Version: 3.5.0
Git SHA: 946a5a6f2
Go Version: go1.16
Go OS/Arch: darwin/amd64

@acastong
Copy link
Author

acastong commented Jul 26, 2021

I am using official etctl clients downloaded through github releases:

$ ./etcdctl version
etcdctl version: 3.4.9
API version: 3.4
$ ./etcdctl_3.5 version
etcdctl version: 3.5.0
API version: 3.5

And the version of etcd running is packaged by bitnami so now sure what build environment was used:

$ kubectl exec acastonguay-etcd-0 --container etcd -- etcd --version
etcd Version: 3.5.0
Git SHA: 946a5a6f2
Go Version: go1.16.3
Go OS/Arch: linux/amd64

But I am getting the same behavior when talking to an official gcr.io/etcd-development/etcd:v3.4.15 container produced by etcd that is running in a 3-VM cluster. It has basic authentication on which I know is not really supported by etcd, hopefully that is not introducing issues too.

Click to expand!
$ perf stat -r30 --table -e task-clock ./etcdctl_3.5 --endpoints=172.29.86.134:2379 endpoint status --user=prom --password=prom
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555098, 3555098,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555102, 3555102,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555104, 3555104,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555108, 3555108,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555110, 3555110,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555114, 3555114,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555118, 3555118,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555120, 3555120,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555124, 3555124,
{"level":"warn","ts":"2021-07-26T10:12:23.323-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003b4700/#initially=[172.29.86.134:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint 172.29.86.134:2379 (failed to getToken from endpoint 172.29.86.134:2379 with maintenance client: context deadline exceeded)
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555129, 3555129,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555131, 3555131,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555135, 3555135,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555137, 3555137,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555141, 3555141,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555143, 3555143,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555147, 3555147,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555149, 3555149,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555153, 3555153,
{"level":"warn","ts":"2021-07-26T10:12:54.211-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000428700/#initially=[172.29.86.134:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2021-07-26T10:12:56.250-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00043ea80/#initially=[172.29.86.134:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555157, 3555157,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555161, 3555161,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555163, 3555163,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555167, 3555167,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555169, 3555169,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555173, 3555173,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555175, 3555175,
172.29.86.134:2379, d82f6fae06668473, 3.4.15, 7.7 MB, true, false, 1156, 3555179, 3555179,
{"level":"warn","ts":"2021-07-26T10:13:23.251-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00045e380/#initially=[172.29.86.134:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

 Performance counter stats for './etcdctl_3.5 --endpoints=172.29.86.134:2379 endpoint status --user=prom --password=prom' (30 runs):

             96.83 msec task-clock:u              #    0.031 CPUs utilized            ( +-  7.28% )

            # Table of individual measurements:
            2.6771 (-0.4135) ####
            3.2108 (+0.1202) #
            3.2181 (+0.1275) #
            3.6316 (+0.5410) ###
            3.9953 (+0.9048) #####
            3.2416 (+0.1510) #
            3.2089 (+0.1183) #
            3.2041 (+0.1135) #
            3.2239 (+0.1333) #
            3.2242 (+0.1336) #
            3.1227 (+0.0322) #
            3.2112 (+0.1206) #
            3.5603 (+0.4697) ###
            2.8618 (-0.2288) ##
            3.2072 (+0.1166) #
            3.2107 (+0.1201) #
            3.2380 (+0.1474) #
            3.2059 (+0.1153) #
            3.2030 (+0.1124) #
            2.0347 (-1.0559) ###########
            2.0382 (-1.0524) ###########
            2.2659 (-0.8247) ########
            3.2073 (+0.1168) #
            3.5023 (+0.4117) ###
            2.9095 (-0.1811) ##
            3.2135 (+0.1230) #
            3.2571 (+0.1665) ##
            3.3296 (+0.2390) ##
            3.2219 (+0.1313) #
            2.0812 (-1.0093) ##########

            # Final result:
            3.0906 +- 0.0832 seconds time elapsed  ( +-  2.69% )

Official version:

$ sudo podman exec etcd-pod-etcd etcd --version
etcd Version: 3.4.15
Git SHA: aa7126864
Go Version: go1.12.17
Go OS/Arch: linux/amd64

We can see it results in some DeadlineExceeded results.

@acastong
Copy link
Author

I just used a standalone etcd 3.4.15 running standalone in a VM without authentication (to rule that out as a factor) and with 3.4 client the average time (30 runs) for get foo for 3.4.9 client is 0.039660s, with 3.5.0 client it is 1.0229s. So the problem is also present without auth.

@acastong
Copy link
Author

acastong commented Aug 31, 2021

After looking at packet traces taken both on the server and on the client, it seems that some network proxy / firewall used within my organization between our development environment and cloud compute nodes where we are hosting etcd is introducing abnormal delay.

When making sure to not go through proxies / firewalls, the problem does not appear.

So the problem is not directly in the client code, although it is strange that this appears only when using 3.5 client. So I would close it for now and if ever we pinpoint a very specify incompatibility between the etcd / gRPC protocol and a specific appliance vendor, then we can open a more specific issue.

@ptabor
Copy link
Contributor

ptabor commented Aug 31, 2021

Might be consequence of #13192.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants