-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential goroutine leak in serverWatchStream #18704
Comments
Based on the diagram you provided, the goroutines were reclaimed periodically. It seems expected behaviour from golang runtime perspectively. |
Sorry, I needed to state that explicitly, "reclamation" happens on etcd cluster node restart. |
Just to confirm, are you saying the goroutine count for each etcd member doesn't decrease until it's restarted? etcdserver creates a etcd/server/etcdserver/api/v3rpc/watch.go Line 193 in bc5512a
Do the clients keep creating stream and never delete them? What's the client side behaviour? |
that's correct.
Stream is created implicitly when new context metadata is passed. I didn't find a way to explicitly close stream on the client side later, but I briefly looked at client code - and it seems like it should close the stream once it has no active watchers (but there is a decent chance I misunderstood the implementation). |
You can close the stream by either closing the client or closing the watch stream using Line 387 in 68e33d4
|
The implementation closes old after new ones are created with new context metadata. |
So far can't repro with a single client in an isolated environment.
and no goroutines stuck sending data to ctrlStream in |
Looks like in other places code does not block on sending data to sws.ctrlStream if sws.closec is closed: etcd/server/etcdserver/api/v3rpc/watch.go Lines 298 to 303 in 40b4715
etcd/server/etcdserver/api/v3rpc/watch.go Lines 339 to 343 in 40b4715
but the codepath I'm seeing in traces does block: etcd/server/etcdserver/api/v3rpc/watch.go Lines 349 to 360 in 40b4715
Wonder if it could be an issue. Maybe this codepath should look like:
|
Have you verified it can resolve your performance "issue"? I do not see any real case which may lead to etcdserver being blocked here. But It won't do any harm to add a protection as you pointed out. |
I already deployed the change to staging env, today/tmrw I’ll be able to
validate in prod.
At a first glance it seems to have fixed the issue, there are no goroutines with following traces:
```
# 0xd9bb8c go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
# 0xd9af65 go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45
```
|
Did you see lots of the etcd/server/etcdserver/api/v3rpc/watch.go Line 350 in 54db7f0
Were the count of etcd/server/etcdserver/api/v3rpc/watch.go Line 378 in 54db7f0
In theory, it seems that there is a race condition between |
In the trace I posted initially there are a 40k+ such goroutines and the number is growing over time:
which points to https://github.com/etcd-io/etcd/blob/v3.5.14/server/etcdserver/api/v3rpc/watch.go#L348 And number of such goroutines does not match number of sendLoops running:
|
Do you have rough numbers (count of recvLoop: count of sendLoop)? |
The first column in goroutine profiles depicts the number of goroutines, in this case 49411:
So 10385 sendLoop ones, 10387 healthy recvLoop ones and 49411 recvLoop stuck on sending data to channel. |
I missed it. thx To double confirm, you do not see this difference (counts of recvLoop and sendLoop) after patching the change in #18704 (comment)? |
High level the change makes sense, all other usages of |
Thanks for the feedback. The reason should be the race condition pointed out in #18704 (comment). The problem is that it's a little hard to reproduce and verify in e2e test. One solution that I can think of is to add two metrics (
The workaround for this performance issue is to restart the etcd instance. |
Reopen to track the backporting effort |
All done. |
Can we ensure backports are discover-able by linking between to the issue? |
Bug report criteria
What happened?
We run several moderately loaded clusters with the following characteristics:
We had a few performance issues at this scale, thus client has a logic to:
It's implemented via context metadata:
We observe a slow what seems to be a slow goroutine leak on our clusters.
so this one in particular seems to be a culprit:
I suspect that it's due to stream not properly closed after all watchers have migrated to a new h/2 stream.
What did you expect to happen?
Steady number of goroutines over time.
How can we reproduce it (as minimally and precisely as possible)?
I'd assume creating and stopping a set of watchers each one with own context metadata would trigger the issue.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: