Description
Describe the bug
We observed a compactor
pod fail due to concurrent map iteration and map write
:
goroutine 597216 [running]:
runtime.throw(0x275e448, 0x26)
/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc002462b60 sp=0xc002462b30 pc=0x438652
runtime.mapiternext(0xc002462c70)
/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc002462be0 sp=0xc002462b60 pc=0x410d2c
github.com/prometheus/client_golang/prometheus.(*constHistogram).Write(0xc001b9e400, 0xc0006fcd90, 0x2, 0x2)
/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:556 +0x179 fp=0xc002462ce0 sp=0xc002462be0 pc=0x880cb9
github.com/prometheus/client_golang/prometheus.processMetric(0x2bbe460, 0xc001b9e400, 0xc002463050, 0xc002463080, 0x0, 0x0, 0x1)
/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/registry.go:598 +0xa2 fp=0xc002462e08 sp=0xc002462ce0 pc=0x885f42
Full output: https://gist.github.com/siggy/8c0cd18a649c78f4cc28d16c19edbedc
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (
v1.10.0
) - Run compactor (this is the first time we've seen this after months of usage).
Expected behavior
Do not fail with concurrent map iteration and map write
.
Environment:
- Infrastructure: AKS
v1.21.1
- Deployment tool: hand-rolled yaml,
compactor
as 3-instance StatefulSet
Storage Engine
- Blocks
- Chunks
Additional Context
The log reported the error at range h.buckets
, iterating over the h.buckets
map in client_golang
's constHistogram
:
cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go
Lines 549 to 561 in 3b9f1c3
I don't immediately see where writes to h.buckets
could be happening, but I'm guessing it's racey with Cortex's HistogramData
, because it shares the buckets
map with constHistogram
:
cortex/pkg/util/metrics_helper.go
Lines 495 to 498 in 3b9f1c3
One possible fix would be to clone the map in HistogramData.Metric()
prior to passing it into prometheus.MustNewConstHistogram
. If we went that route, there are similar places in the code that may also need this fix:
cortex/pkg/util/metrics_helper.go
Lines 453 to 455 in 3b9f1c3
That said, I see some mutex protection already in-place in HistogramDataCollector
, so not sure whether similar mutex protection would be a more holistic fix.