Skip to content

Panic in Distributor in Mimir 2.13 #10724

Closed
@lasermoth

Description

@lasermoth

What is the bug?

During a rollout of distributor pods I noticed some pods panic during shutdown.

1740358044474	{"caller":"signals.go:62","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2025-02-24T00:47:24.455676056Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"active-groups-cleanup-service","msg":"module stopped","ts":"2025-02-24T00:47:29.463218373Z"}
1740358049474	{"caller":"basic_lifecycler.go:238","level":"info","msg":"ring lifecycler is shutting down","ring":"distributor","ts":"2025-02-24T00:47:29.463913053Z"}
1740358049474	{"caller":"basic_lifecycler.go:403","level":"info","msg":"unregistering instance from ring","ring":"distributor","ts":"2025-02-24T00:47:29.464202522Z"}
1740358049474	{"caller":"basic_lifecycler.go:278","level":"info","msg":"instance removed from the ring","ring":"distributor","ts":"2025-02-24T00:47:29.464347533Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"distributor-service","msg":"module stopped","ts":"2025-02-24T00:47:29.464872722Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"ingester-ring","msg":"module stopped","ts":"2025-02-24T00:47:29.464974382Z"}
1740358049474	{"caller":"module_service.go:120","level":"info","module":"runtime-config","msg":"module stopped","ts":"2025-02-24T00:47:29.465042882Z"}
1740358049474	{"caller":"memberlist_client.go:720","level":"info","msg":"leaving memberlist cluster","ts":"2025-02-24T00:47:29.465085602Z"}
1740358049489	2025/02/24 00:47:29 http: panic serving 10.252.42.5:47292: send on closed channel
1740358049489	goroutine 399553 [running]:
1740358049489	net/http.(*conn).serve.func1()
1740358049489		/usr/local/go/src/net/http/server.go:1903 +0xbe
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5.1()
1740358049489		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:155 +0x175
1740358049489	panic({0x2706d20?, 0x3539f10?})
1740358049489		/usr/local/go/src/runtime/panic.go:770 +0x132
1740358049489	github.com/grafana/dskit/concurrency.(*ReusableGoroutinesPool).Go(0x29c75c0?, 0xc01d3c18c0)
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/worker.go:28 +0x25
1740358049489	github.com/grafana/dskit/ring.DoBatchWithOptions({0x356b828, 0xc01d3c1860}, 0x1, {0x35602d0, 0xc000c9a908}, {0xc01d3ca000, 0x7d0, 0x2872ec0?}, 0xc01d3c1890, {0xc01d3c2b70, ...})
1740358049489		/__w/mimir/mimir/vendor/github.com/grafana/dskit/ring/batch.go:180 +0x722
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToIngesters(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0x35602d0, 0xc000c9a908}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1579 +0x136
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).sendWriteRequestToBackends(0xc001c9c808, {0x356b828, 0xc01d3c1860}, {0xc0253d9745, 0xb}, 0xc01ceead40, {0xc01d3ca000, 0x7d0, 0x7d0}, 0x7d0, ...)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1536 +0x90a
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).push(0xc001c9c808, {0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1485 +0x65b
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x19535758676?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushValidationMiddleware-fm.(*Distributor).prePushValidationMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1131 +0xd78
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc0253d9745?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushSortAndFilterMiddleware-fm.(*Distributor).prePushSortAndFilterMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:988 +0x21d
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0x107283d3b67288c4?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushRelabelMiddleware-fm.(*Distributor).prePushRelabelMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:943 +0x4d6
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xb?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).prePushHaDedupeMiddleware-fm.(*Distributor).prePushHaDedupeMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:886 +0x752
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc001dc8f10?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).metricsMiddleware-fm.(*Distributor).metricsMiddleware.func1({0x356b828, 0xc01ceedd70}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1177 +0x3e7
1740358049489	github.com/grafana/mimir/pkg/distributor.NextOrCleanup.func1({0x356b828?, 0xc01ceedd70?}, 0xc01ceedd40?)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1372 +0x2b
1740358049489	github.com/grafana/mimir/pkg/distributor.(*Distributor).limitsMiddleware-fm.(*Distributor).limitsMiddleware.func1({0x356b828?, 0xc01ceedd40?}, 0xc01c0f1b90)
1740358049489		/__w/mimir/mimir/pkg/distributor/distributor.go:1360 +0x237
1740358049489	github.com/grafana/mimir/pkg/api.(*API).RegisterDistributor.Handler.handler.func2({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049489		/__w/mimir/mimir/pkg/distributor/push.go:159 +0x25a
1740358049489	net/http.HandlerFunc.ServeHTTP(0x0?, {0x3566d60?, 0xc01ceead00?}, 0x412005?)
1740358049489		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.(*API).newRoute.ConsistencyMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9440)
1740358049490		/__w/mimir/mimir/pkg/querier/api/consistency.go:58 +0xab
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9320?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc92a8?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/mimir/pkg/api.New.newTenantValidationMiddleware.func1.1({0x3566d60, 0xc01ceead00}, 0xc01ced9320)
1740358049490		/__w/mimir/mimir/pkg/api/tenant.go:43 +0x174
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced9200?, {0x3566d60?, 0xc01ceead00?}, 0x3530a01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.init.func2.1({0x3566d60, 0xc01ceead00}, 0xc01ced9200)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/http_auth.go:21 +0x108
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced90e0?, {0x3566d60?, 0xc01ceead00?}, 0xc001dc9420?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/gorilla/mux.(*Router).ServeHTTP(0xc000000480, {0x3566d60, 0xc01ceead00}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/gorilla/mux/mux.go:212 +0x1e2
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1.2({0x3566d60?, 0xc01ceead00?})
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:89 +0x33
1740358049490	github.com/felixge/httpsnoop.(*Metrics).CaptureMetrics(0xc0273e0eb8, {0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:84 +0x1e5
1740358049490	github.com/felixge/httpsnoop.CaptureMetricsFn({0x7c386eb2ee70, 0xc051b57080}, 0xc001dc9750)
1740358049490		/__w/mimir/mimir/vendor/github.com/felixge/httpsnoop/capture_metrics.go:39 +0x4e
1740358049490	github.com/grafana/dskit/middleware.(*Instrument).Wrap.Instrument.Wrap.func1({0x7c386eb2ee70, 0xc051b57080}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/instrument.go:88 +0x2dd
1740358049490	net/http.HandlerFunc.ServeHTTP(0x3562a50?, {0x7c386eb2ee70?, 0xc051b57080?}, 0xc01ceedc50?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*Log).Wrap.Log.Wrap.func1({0x3562a50, 0xc051b57020}, 0xc01ced8fc0)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/logging.go:90 +0x26f
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x3562a50?, 0xc051b57020?}, 0xc000ca3901?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5({0x355f9a0, 0xc024c26c40}, 0xc01ced8c60)
1740358049490		/__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:159 +0x4d6
1740358049490	net/http.HandlerFunc.ServeHTTP(0xc01ced8b40?, {0x355f9a0?, 0xc024c26c40?}, 0x7c386eaec6e0?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	github.com/grafana/dskit/middleware.(*RouteInjector).Wrap.RouteInjector.Wrap.func1({0x355f9a0, 0xc024c26c40}, 0xc01ced8b40)
1740358049490		/__w/mimir/mimir/vendor/github.com/grafana/dskit/middleware/route_injector.go:24 +0x72
1740358049490	net/http.HandlerFunc.ServeHTTP(0x412005?, {0x355f9a0?, 0xc024c26c40?}, 0xc024c26c01?)
1740358049490		/usr/local/go/src/net/http/server.go:2171 +0x29
1740358049490	net/http.serverHandler.ServeHTTP({0x3557af8?}, {0x355f9a0?, 0xc024c26c40?}, 0x6?)
1740358049490		/usr/local/go/src/net/http/server.go:3142 +0x8e
1740358049490	net/http.(*conn).serve(0xc01c0f19e0, {0x356b828, 0xc001f20120})
1740358049490		/usr/local/go/src/net/http/server.go:2044 +0x5e8
1740358049490	created by net/http.(*Server).Serve in goroutine 301
1740358049490		/usr/local/go/src/net/http/server.go:3290 +0x4b4
1740358049504	2025/02/24 00:47:29 http: panic serving 10.252.95.26:32878: send on closed channel
1740358049504	goroutine 399493 [running]:
1740358049504	net/http.(*conn).serve.func1()
1740358049504		/usr/local/go/src/net/http/server.go:1903 +0xbe
......
......
# This continues for other goroutines

It does not happen on all pods, and seems rather random.

I was able to reproduce this by terminating a single pod, and weirdly this also caused CPU to drop on all other distributors.

Image

I did note some 503s from Prometheus at the time which is likely why this is happened but interesting a single pod caused this behaviour.

Likely the most critical thing of note is changes to the grpc config to try and align with the GKE graceful termination period of 15 sec for spot instances.

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

This could possibly be the factor here, in which case spot instances would not be possible.

The CPU drop across all pods is confusing however as I would expect Prometheus to just retry if a single distributor pod had terminated the connection unexpectedly.

How to reproduce it?

Unsure.

Possibly with the GRPC changes

   - -server.grpc.keepalive.max-connection-age=10s
   - -server.grpc.keepalive.max-connection-age-grace=5s
   - -server.grpc.keepalive.max-connection-idle=10s
   - -shutdown-delay=5s

What did you think would happen?

Graceful shutdown of the distirbutor pod.

What was your environment?

Mimir 2.13,
GKE 1.31.5-gke.1068000

Any additional context to share?

I could not see anything related in the release notes through to 2.15, but am happy to upgrade.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions