goroutine leak during blocking queries #15010

mechpen · 2022-10-17T16:55:55Z

Overview of the Issue

We observed frequent spikes in the number of goroutines, from 40k to ~1M. We located the root cause to be a goroutine leak in go-memdb.

Reproduction Steps

Steps to reproduce this issue:

Start a consul server.
Create a test service with 10000 instances, and with some tag, e.g. service name: large-service, tag: test-tag.
Run curl "localhost:8500/v1/catalog/service/large-service?tag=test-tag&index=99999999999"
Keep updating the service large-service.
Observe consul goroutine count.

The curl command should not return because the index is very large (expected).
The go routine count spikes (not expected).

Cause analysis

The function blockingQuery() has a for loop that calls go-memdb WatchSet.WatchCtx() to watch the service-related states. When any related state changes, the WatchCtx() function returns. If the service does not actually change, or the MinQueryIndex is large, then the for loop starts a new iteration. So WatchCtx() could be called many times in one blockingQuery() call.

The go-memdb WatchSet.watchMany() has a bug that leaks goroutines. This causes consul goroutine spikes during blocking queries. These goroutines are cleaned up when the blocking query returns, either due to the service update or time out.

Consul info for both Client and Server

Found in consul 1.11.7-ent.

The text was updated successfully, but these errors were encountered:

mechpen · 2022-10-17T19:59:04Z

The early returns of WatchSet.WatchCtx() in the function blockingQuery() could be caused by some watch optimization.

Function agent/consul/state/catalog.go:parseCheckServiceNodes() calls the following:

ws.AddWithLimit(watchLimit, watchCh, allNodesCh)
ws.AddWithLimit(watchLimit, iter.WatchCh(), allChecksCh)

When watchLimit (8192) is reached, it starts to watch allNodesCh and allChecksCh, and may cause WatchCtx() to return early.

clobrox · 2022-10-19T20:44:31Z

Just want to provide an update. We have reproduced without using an artificially large index value.

#1. We create 10K instance large-service with the gag test-tag. We create a second single instance service dummy-service.
#2. We issue a blocking query -- curl "localhost:8500/v1/catalog/service/large-service?tag=test-tag".
#3. We update dummy-service -- not large-service -- and we see the goroutines increase meaningfully without the blocking query return.

If we do #3 repeatedly, we can produce very large goroutine values. Once the blocking query returns -- on a change to large-service or on a timeout of the blocking query, the goroutine value returns to normal.

This matches what we see in our large production cluster. We have a job that does a /v1/health/service blocking query with a tag on a 4193 instance service with a 10m timeout. By the time the whole 10m elapses, we see the spikes up to 1M that @mechpen mentions.

We enabled streaming for the nodes running the instances of the job issuing these queries and the problem went away.

Even though streaming helps with our current issue, we'd like to see this fixed, since we're worried about new queries being introduced to our clusters from nodes without streaming enabled or using calls that do not support streaming. We are worried if a lot of calls got introduced in short order, it could kill our Consul cluster.

mechpen changed the title ~~goroutine leaks in blocking queries~~ goroutine leaks during blocking queries Oct 17, 2022

mechpen changed the title ~~goroutine leaks during blocking queries~~ goroutine leak during blocking queries Oct 17, 2022

kisunji mentioned this issue Oct 20, 2022

Update go-memdb to fix a goroutine leak #15068

Merged

kisunji closed this as completed in #15068 Oct 20, 2022

This was referenced Oct 20, 2022

Backport of Update go-memdb to fix a goroutine leak into release/1.11.x #15076

Merged

Backport of Update go-memdb to fix a goroutine leak into release/1.12.x #15077

Merged

Backport of Update go-memdb to fix a goroutine leak into release/1.13.x #15078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

goroutine leak during blocking queries #15010

goroutine leak during blocking queries #15010

mechpen commented Oct 17, 2022

mechpen commented Oct 17, 2022

clobrox commented Oct 19, 2022

goroutine leak during blocking queries #15010

goroutine leak during blocking queries #15010

Comments

mechpen commented Oct 17, 2022

Overview of the Issue

Reproduction Steps

Cause analysis

Consul info for both Client and Server

mechpen commented Oct 17, 2022

clobrox commented Oct 19, 2022