-
Notifications
You must be signed in to change notification settings - Fork 827
Description
Describe the bug
We observed a compactor
pod fail due to concurrent map iteration and map write
:
goroutine 597216 [running]:
runtime.throw(0x275e448, 0x26)
/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc002462b60 sp=0xc002462b30 pc=0x438652
runtime.mapiternext(0xc002462c70)
/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc002462be0 sp=0xc002462b60 pc=0x410d2c
github.com/prometheus/client_golang/prometheus.(*constHistogram).Write(0xc001b9e400, 0xc0006fcd90, 0x2, 0x2)
/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:556 +0x179 fp=0xc002462ce0 sp=0xc002462be0 pc=0x880cb9
github.com/prometheus/client_golang/prometheus.processMetric(0x2bbe460, 0xc001b9e400, 0xc002463050, 0xc002463080, 0x0, 0x0, 0x1)
/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/registry.go:598 +0xa2 fp=0xc002462e08 sp=0xc002462ce0 pc=0x885f42
Full output: https://gist.github.com/siggy/8c0cd18a649c78f4cc28d16c19edbedc
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (
v1.10.0
) - Run compactor (this is the first time we've seen this after months of usage).
Expected behavior
Do not fail with concurrent map iteration and map write
.
Environment:
- Infrastructure: AKS
v1.21.1
- Deployment tool: hand-rolled yaml,
compactor
as 3-instance StatefulSet
Storage Engine
- Blocks
- Chunks
Additional Context
The log reported the error at range h.buckets
, iterating over the h.buckets
map in client_golang
's constHistogram
:
cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go
Lines 549 to 561 in 3b9f1c3
func (h *constHistogram) Write(out *dto.Metric) error { | |
his := &dto.Histogram{} | |
buckets := make([]*dto.Bucket, 0, len(h.buckets)) | |
his.SampleCount = proto.Uint64(h.count) | |
his.SampleSum = proto.Float64(h.sum) | |
for upperBound, count := range h.buckets { | |
buckets = append(buckets, &dto.Bucket{ | |
CumulativeCount: proto.Uint64(count), | |
UpperBound: proto.Float64(upperBound), | |
}) | |
} |
I don't immediately see where writes to h.buckets
could be happening, but I'm guessing it's racey with Cortex's HistogramData
, because it shares the buckets
map with constHistogram
:
cortex/pkg/util/metrics_helper.go
Lines 495 to 498 in 3b9f1c3
// Return prometheus metric from this histogram data. | |
func (d *HistogramData) Metric(desc *prometheus.Desc, labelValues ...string) prometheus.Metric { | |
return prometheus.MustNewConstHistogram(desc, d.sampleCount, d.sampleSum, d.buckets, labelValues...) | |
} |
One possible fix would be to clone the map in HistogramData.Metric()
prior to passing it into prometheus.MustNewConstHistogram
. If we went that route, there are similar places in the code that may also need this fix:
cortex/pkg/util/metrics_helper.go
Lines 453 to 455 in 3b9f1c3
func (s *SummaryData) Metric(desc *prometheus.Desc, labelValues ...string) prometheus.Metric { | |
return prometheus.MustNewConstSummary(desc, s.sampleCount, s.sampleSum, s.quantiles, labelValues...) | |
} |
That said, I see some mutex protection already in-place in HistogramDataCollector
, so not sure whether similar mutex protection would be a more holistic fix.