Description
Setup a small cluster of 4 nodes, having 3 nodes + 1 proxy (config has ETCD_PROXY="on" which is not the default).
On the proxy node, check for open TCP sockets like this:
netstat -n -p -a | fgrep etcd
When local clients on the proxy node connect to 127.0.0.1:2379 we can see a new connection from an ephemeral TCP port on the proxy node over to port 2379 on one of the other three working nodes, and that's fine because it is the proxy behaviour in operation. However these proxy connections do not properly clean up when etcd is long lived. Checking with netstat as per above shows more and more lines of output as more activity goes via the proxy. Over time, the available file-handles are consumed and eventually it will refuse connections.
etcd: http: Accept error: accept tcp [::]:2379: accept4: too many open files; retrying in 5ms
This appears related to an old issue which possibly has come back, or maybe never got fixed 100% in the first place which is here ... #1959
Restarting etcd temporarily gets it working again, only to have the file handles gradually get consumed over time, requiring more restarts. It is not necessary to restart the entire cluster, merely restarting the proxy node is sufficient, so this guy is certainly the culprit.
Platform is CentOS-7 using the packaged etcd installed by "yum" and launched via "systemd" as follows:
Name : etcd
Arch : x86_64
Version : 3.2.7
Release : 1.el7
Size : 39 M
Repo : installed
From repo : extras
Summary : A highly-available key value store for shared configuration
URL : https://github.com/coreos/etcd
License : ASL 2.0
Description : A highly-available key value store for shared configuration.
In the maps I see the following libraries are being used by the etcd process:
/usr/lib64/libc-2.17.so
/usr/lib64/libdl-2.17.so
/usr/lib64/libpthread-2.17.so
/usr/lib64/ld-2.17.so
These are all very standard CentOS system libraries, I doubt the bug is in the library, but at least you should be able to reproduce the same setup fairly easily. Many people in comments on the older issues reported similar setup (3 nodes + 1 proxy) was the way to reproduce this problem, so it would appear to be quite consistently happening. We are running a bunch of web servers, and a load balancer, sharing session data via etcd which should be fairly simple read/write type operations that can easily be simulated for testing. I'm guessing that the content of the data is irrelevant; typical size of data block might be approx 1k bytes.