Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

goroutine leak during blocking queries #15010

Closed
mechpen opened this issue Oct 17, 2022 · 2 comments · Fixed by #15068
Closed

goroutine leak during blocking queries #15010

mechpen opened this issue Oct 17, 2022 · 2 comments · Fixed by #15068

Comments

@mechpen
Copy link

mechpen commented Oct 17, 2022

Overview of the Issue

We observed frequent spikes in the number of goroutines, from 40k to ~1M. We located the root cause to be a goroutine leak in go-memdb.

Reproduction Steps

Steps to reproduce this issue:

  1. Start a consul server.
  2. Create a test service with 10000 instances, and with some tag, e.g. service name: large-service, tag: test-tag.
  3. Run curl "localhost:8500/v1/catalog/service/large-service?tag=test-tag&index=99999999999"
  4. Keep updating the service large-service.
  5. Observe consul goroutine count.

The curl command should not return because the index is very large (expected).
The go routine count spikes (not expected).

Cause analysis

The function blockingQuery() has a for loop that calls go-memdb WatchSet.WatchCtx() to watch the service-related states. When any related state changes, the WatchCtx() function returns. If the service does not actually change, or the MinQueryIndex is large, then the for loop starts a new iteration. So WatchCtx() could be called many times in one blockingQuery() call.

The go-memdb WatchSet.watchMany() has a bug that leaks goroutines. This causes consul goroutine spikes during blocking queries. These goroutines are cleaned up when the blocking query returns, either due to the service update or time out.

Consul info for both Client and Server

Found in consul 1.11.7-ent.

@mechpen mechpen changed the title goroutine leaks in blocking queries goroutine leaks during blocking queries Oct 17, 2022
@mechpen mechpen changed the title goroutine leaks during blocking queries goroutine leak during blocking queries Oct 17, 2022
@mechpen
Copy link
Author

mechpen commented Oct 17, 2022

The early returns of WatchSet.WatchCtx() in the function blockingQuery() could be caused by some watch optimization.

Function agent/consul/state/catalog.go:parseCheckServiceNodes() calls the following:

  • ws.AddWithLimit(watchLimit, watchCh, allNodesCh)
  • ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh)

When watchLimit (8192) is reached, it starts to watch allNodesCh and allChecksCh, and may cause WatchCtx() to return early.

@clobrox
Copy link

clobrox commented Oct 19, 2022

Just want to provide an update. We have reproduced without using an artificially large index value.

#1. We create 10K instance large-service with the gag test-tag. We create a second single instance service dummy-service.
#2. We issue a blocking query -- curl "localhost:8500/v1/catalog/service/large-service?tag=test-tag".
#3. We update dummy-service -- not large-service -- and we see the goroutines increase meaningfully without the blocking query return.

If we do #3 repeatedly, we can produce very large goroutine values. Once the blocking query returns -- on a change to large-service or on a timeout of the blocking query, the goroutine value returns to normal.

This matches what we see in our large production cluster. We have a job that does a /v1/health/service blocking query with a tag on a 4193 instance service with a 10m timeout. By the time the whole 10m elapses, we see the spikes up to 1M that @mechpen mentions.

We enabled streaming for the nodes running the instances of the job issuing these queries and the problem went away.

Even though streaming helps with our current issue, we'd like to see this fixed, since we're worried about new queries being introduced to our clusters from nodes without streaming enabled or using calls that do not support streaming. We are worried if a lot of calls got introduced in short order, it could kill our Consul cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants