-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher (6x) delay for 3.5.0 client to connect to cluster compared to 3.4 client #13240
Comments
I just upgraded my cluster to version 3.5 and it did not improve the delay, so this high delay problem does not seem limited to connecting to a cluster of a previous version. I will update the issue description accordingly.
|
Thanks for the issue, sounds like a large delay. How often have you run the above tests? Average might be more useful, than a one off. I will try it out to reproduce as well! Did you by any chance try with direct client access, or just with the CLI tool? |
I first saw the problem with the golang 3.5.0 client. We had a default timeout of 500ms and since upgrade, were seeing context deadline exceeded errors that we did not see with the 3.3.x client. My first thought was that this was the way we were initializing the client, but since the etcdctl application is also impacted it is most likely something more fundamental. Here is a more statistically significant 30 runs results with the 3.5.0 version of etcdctl:
And with the 3.4.9 version of etcdctl:
|
couple questions
$ etcd --version |
I can't seem to reproduce this. I have tried with etcd running locally as well as etcd running in my infra and both seem to be fast enough for me not to notice any latency. Note this is with etcd 3.5.0 with both in cluster and etcdctl with 3.5.0. Ran 10 times or so and this is always the result:
Note mine are all complied locally with go 1.16:
|
I am using official etctl clients downloaded through github releases:
And the version of etcd running is packaged by bitnami so now sure what build environment was used:
But I am getting the same behavior when talking to an official gcr.io/etcd-development/etcd:v3.4.15 container produced by etcd that is running in a 3-VM cluster. It has basic authentication on which I know is not really supported by etcd, hopefully that is not introducing issues too. Click to expand!
Official version:
We can see it results in some |
I just used a standalone etcd 3.4.15 running standalone in a VM without authentication (to rule that out as a factor) and with 3.4 client the average time (30 runs) for |
After looking at packet traces taken both on the server and on the client, it seems that some network proxy / firewall used within my organization between our development environment and cloud compute nodes where we are hosting etcd is introducing abnormal delay. When making sure to not go through proxies / firewalls, the problem does not appear. So the problem is not directly in the client code, although it is strange that this appears only when using 3.5 client. So I would close it for now and if ever we pinpoint a very specify incompatibility between the etcd / gRPC protocol and a specific appliance vendor, then we can open a more specific issue. |
Might be consequence of #13192. |
Problem description
When a 3.5.0 client connects to a 3.4.16 server, the delay is much longer and increases from ~0.2 seconds in 3.4 client to 1.2s and more using 3.5.0 client. Here is the delay we were seeing with the 3.4.9 client:
And here is the delay we are seeing with the 3.5.0 client:
When in a Go application using the 3.5.0 client library, the default timeout of 2000ms is often not sufficient anymore and we can get context deadline exceeded from time to time. Moving the timeout up to 3000ms seems to work, but there is significant delays when making requests.
How to reproduce
Startup a 3.4.16 server on a Kubernetes cluster using
helm install my-etcd -f values.yaml bitnami/etcd
with attached values.yaml. This is a simple 3-node etcd without authentication enabled, but that is only exposed through a single service.Use a 3.4.x and 3.5.0 etcdtl to connect to your etcd cluster to measure the time required to get a response.
Please note that I also reproduce with a more traditional 3.4.x server launched on 3 distinct reachable VMs with simple authentication enabled. In this case getting the endpoint status can take over 3 seconds!
The text was updated successfully, but these errors were encountered: