Description
What is the bug?
During a rollout of distributor pods I noticed some pods panic during shutdown.
1740358044474 {"caller":"signals.go:62","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2025-02-24T00:47:24.455676056Z"}
1740358049474 {"caller":"module_service.go:120","level":"info","module":"active-groups-cleanup-service","msg":"module stopped","ts":"2025-02-24T00:47:29.463218373Z"}
1740358049474 {"caller":"basic_lifecycler.go:238","level":"info","msg":"ring lifecycler is shutting down","ring":"distributor","ts":"2025-02-24T00:47:29.463913053Z"}
1740358049474 {"caller":"basic_lifecycler.go:403","level":"info","msg":"unregistering instance from ring","ring":"distributor","ts":"2025-02-24T00:47:29.464202522Z"}
1740358049474 {"caller":"basic_lifecycler.go:278","level":"info","msg":"instance removed from the ring","ring":"distributor","ts":"2025-02-24T00:47:29.464347533Z"}
1740358049474 {"caller":"module_service.go:120","level":"info","module":"distributor-service","msg":"module stopped","ts":"2025-02-24T00:47:29.464872722Z"}
1740358049474 {"caller":"module_service.go:120","level":"info","module":"ingester-ring","msg":"module stopped","ts":"2025-02-24T00:47:29.464974382Z"}
1740358049474 {"caller":"module_service.go:120","level":"info","module":"runtime-config","msg":"module stopped","ts":"2025-02-24T00:47:29.465042882Z"}
1740358049474 {"caller":"memberlist_client.go:720","level":"info","msg":"leaving memberlist cluster","ts":"2025-02-24T00:47:29.465085602Z"}
1740358049489 2025/02/24 00:47:29 http: panic serving 10.252.42.5:47292: send on closed channel
1740358049489 goroutine 399553 [running]:
1740358049489 net/http.(*conn).serve.func1()
1740358049489 /usr/local/go/src/net/http/server.go:1903 +0xbe
1740358049489 panic({0x2706d20?, 0x3539f10?})
1740358049489 /usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489 github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5.1()
1740358049489 /__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:155 +0x175
1740358049489 panic({0x2706d20?, 0x3539f10?})
1740358049489 /usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489 github.com/grafana/dskit/concurrency.(*ReusableGoroutinesPool).Go(0x29c75c0?, 0xc01d3c18c0)
1740358049489 /__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/worker.go:28 +0x25
1740358049489 github.com/grafana/dskit/ring.DoBatchWithOptions({0x356b828, 0xc01d3c1860}, 0x1, {0x35602d0, 0xc000c9a908}, {0xc01d3ca000, 0x7d0, 0x2872ec0?}, 0xc01d3c1890, {0xc01d3c2b70, ...})
1740358049489 /__w/mimir/mimir/vendor/github.com/grafana/dskit/ring/batch.go:180 +0x722
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToIngesters(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0x35602d0, 0xc000c9a908}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1579 +0x136
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToBackends(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0xc0253d9745, 0xb}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1536 +0x90a
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).push(0xc001c9c808, {0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1485 +0x65b
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x19535758676?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushValidationMiddleware-fm.(*Distributor).prePushValidationMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1131 +0xd78
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc0253d9745?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushSortAndFilterMiddleware-fm.(*Distributor).prePushSortAndFilterMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:988 +0x21d
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x107283d3b67288c4?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushRelabelMiddleware-fm.(*Distributor).prePushRelabelMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:943 +0x4d6
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xb?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushHaDedupeMiddleware-fm.(*Distributor).prePushHaDedupeMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:886 +0x752
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc001dc8f10?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).metricsMiddleware-fm.(*Distributor).metricsMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1177 +0x3e7
1740358049489 github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc01ceedd40?)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489 github.com/grafana/mimir/pkg/distributor.(*Distributor).limitsMiddleware-fm.(*Distributor).limitsMiddleware.func1({0x356b828?, 0xc01ceedd40?}, 0xc01c0f1b90)
1740358049489 /__w/mimir/mimir/pkg/distributor/distributor.go:1360 +0x237
1740358049489 github.com/grafana/mimir/pkg/api.(*API).RegisterDistributor.Handler.handler.func2({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049489 /__w/mimir/mimir/pkg/distributor/push.go:159 +0x25a
1740358049489 net/http.HandlerFunc.ServeHTTP(0x0?, {0x3566d60?, 0xc01ceead00?}, 0x412005?)
1740358049489 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/grafana/mimir/pkg/api.(*API).newRoute.ConsistencyMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049490 /__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xab
1740358049490 net/http.HandlerFunc.ServeHTTP(0xc01ced9320?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc92a8?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/grafana/mimir/pkg/api.New.newTenantValidationMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9320)
1740358049490 /__w/mimir/mimir/pkg/api/tenant.go:43 +0x174
1740358049490 net/http.HandlerFunc.ServeHTTP(0xc01ced9200?, {0x3566d60?, 0xc01ceead00?}, 0x3530a01?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/grafana/dskit/middleware.init.func2.1({0x3566d60, 0xc01ceead00}, 0xc01ced9200)
1740358049490 /__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/http_auth.go:21 +0x108
1740358049490 net/http.HandlerFunc.ServeHTTP(0xc01ced90e0?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc9420?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000480, {0x3566d60, 0xc01ceead00}, 0xc01ced8fc0)
1740358049490 /__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x1e2
1740358049490 github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1.2({0x3566d60?, 0xc01ceead00?})
1740358049490 /__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:89 +0x33
1740358049490 github.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0xc0273e0eb8, {0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490 /__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x1e5
1740358049490 github.com/felixge/httpsnoop.CaptureMetricsFn({0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490 /__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x4e
1740358049490 github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1({0x7c386eb2ee70, 0xc051b57080}, 0xc01ced8fc0)
1740358049490 /__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:88 +0x2dd
1740358049490 net/http.HandlerFunc.ServeHTTP(0x3562a50?, {0x7c386eb2ee70?, 0xc051b57080?}, 0xc01ceedc50?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/grafana/dskit/middleware.(*Log).Wrap.Log.Wrap.func1({0x3562a50, 0xc051b57020}, 0xc01ced8fc0)
1740358049490 /__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/logging.go:90 +0x26f
1740358049490 net/http.HandlerFunc.ServeHTTP(0x412005?, {0x3562a50?, 0xc051b57020?}, 0xc000ca3901?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x355f9a0, 0xc024c26c40}, 0xc01ced8c60)
1740358049490 /__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:159 +0x4d6
1740358049490 net/http.HandlerFunc.ServeHTTP(0xc01ced8b40?, {0x355f9a0?, 0xc024c26c40?}, 0x7c386eaec6e0?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 github.com/grafana/dskit/middleware.(*RouteInjector).Wrap.RouteInjector.Wrap.func1({0x355f9a0, 0xc024c26c40}, 0xc01ced8b40)
1740358049490 /__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/route_injector.go:24 +0x72
1740358049490 net/http.HandlerFunc.ServeHTTP(0x412005?, {0x355f9a0?, 0xc024c26c40?}, 0xc024c26c01?)
1740358049490 /usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490 net/http.serverHandler.ServeHTTP({0x3557af8?}, {0x355f9a0?, 0xc024c26c40?}, 0x6?)
1740358049490 /usr/local/go/src/net/http/server.go:3142 +0x8e
1740358049490 net/http.(*conn).serve(0xc01c0f19e0, {0x356b828, 0xc001f20120})
1740358049490 /usr/local/go/src/net/http/server.go:2044 +0x5e8
1740358049490 created by net/http.(*Server).Serve in goroutine 301
1740358049490 /usr/local/go/src/net/http/server.go:3290 +0x4b4
1740358049504 2025/02/24 00:47:29 http: panic serving 10.252.95.26:32878: send on closed channel
1740358049504 goroutine 399493 [running]:
1740358049504 net/http.(*conn).serve.func1()
1740358049504 /usr/local/go/src/net/http/server.go:1903 +0xbe
......
......
# This continues for other goroutines
It does not happen on all pods, and seems rather random.
I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.

I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.
Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.
- -server.grpc.keepalive.max-connection-age=10s
- -server.grpc.keepalive.max-connection-age-grace=5s
- -server.grpc.keepalive.max-connection-idle=10s
- -shutdown-delay=5s
This could possibly be the factor here, in which case spot instances would not be possible.
The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.
How to reproduce it?
Unsure.
Possibly with the GRPC changes
- -server.grpc.keepalive.max-connection-age=10s
- -server.grpc.keepalive.max-connection-age-grace=5s
- -server.grpc.keepalive.max-connection-idle=10s
- -shutdown-delay=5s
What did you think would happen?
Graceful shutdown of the distirbutor pod.
What was your environment?
Mimir 2.13,
GKE 1.31.5-gke.1068000
Any additional context to share?
I could not see anything related in the release notes through to 2.15, but am happy to upgrade.