Compactor fails with `concurrent map iteration and map write`

**Describe the bug**
We observed a `compactor` pod fail due to `concurrent map iteration and map write`:

```
goroutine 597216 [running]:
runtime.throw(0x275e448, 0x26)
	/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc002462b60 sp=0xc002462b30 pc=0x438652
runtime.mapiternext(0xc002462c70)
	/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc002462be0 sp=0xc002462b60 pc=0x410d2c
github.com/prometheus/client_golang/prometheus.(*constHistogram).Write(0xc001b9e400, 0xc0006fcd90, 0x2, 0x2)
	/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:556 +0x179 fp=0xc002462ce0 sp=0xc002462be0 pc=0x880cb9
github.com/prometheus/client_golang/prometheus.processMetric(0x2bbe460, 0xc001b9e400, 0xc002463050, 0xc002463080, 0x0, 0x0, 0x1)
	/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/registry.go:598 +0xa2 fp=0xc002462e08 sp=0xc002462ce0 pc=0x885f42
```

Full output: https://gist.github.com/siggy/8c0cd18a649c78f4cc28d16c19edbedc

**To Reproduce**
Steps to reproduce the behavior:
1. Start Cortex (`v1.10.0`)
2. Run compactor (this is the first time we've seen this after months of usage).

**Expected behavior**
Do not fail with `concurrent map iteration and map write`.

**Environment:**
 - Infrastructure: AKS `v1.21.1`
 - Deployment tool: hand-rolled yaml, `compactor` as 3-instance StatefulSet

**Storage Engine**
- [x] Blocks
- [ ] Chunks

**Additional Context**

The log reported the error at `range h.buckets`, iterating over the `h.buckets` map in `client_golang`'s `constHistogram`:
https://github.com/cortexproject/cortex/blob/3b9f1c3f61809e5bc3e241608243dbe7d4a73135/vendor/github.com/prometheus/client_golang/prometheus/histogram.go#L549-L561

I don't immediately see where writes to `h.buckets` could be happening, but I'm guessing it's racey with Cortex's `HistogramData`, because it shares the `buckets` map with `constHistogram`:
https://github.com/cortexproject/cortex/blob/3b9f1c3f61809e5bc3e241608243dbe7d4a73135/pkg/util/metrics_helper.go#L495-L498

One possible fix would be to clone the map in `HistogramData.Metric()` prior to passing it into `prometheus.MustNewConstHistogram`. If we went that route, there are similar places in the code that may also need this fix:
https://github.com/cortexproject/cortex/blob/3b9f1c3f61809e5bc3e241608243dbe7d4a73135/pkg/util/metrics_helper.go#L453-L455

That said, I see some mutex protection already in-place in `HistogramDataCollector`, so not sure whether similar mutex protection would be a more holistic fix.

	func (h constHistogram) Write(out dto.Metric) error {
	his := &dto.Histogram{}
	buckets := make([]*dto.Bucket, 0, len(h.buckets))

	his.SampleCount = proto.Uint64(h.count)
	his.SampleSum = proto.Float64(h.sum)

	for upperBound, count := range h.buckets {
	buckets = append(buckets, &dto.Bucket{
	CumulativeCount: proto.Uint64(count),
	UpperBound: proto.Float64(upperBound),
	})
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compactor fails with `concurrent map iteration and map write` #4480

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	// Return prometheus metric from this histogram data.
	func (d HistogramData) Metric(desc prometheus.Desc, labelValues ...string) prometheus.Metric {
	return prometheus.MustNewConstHistogram(desc, d.sampleCount, d.sampleSum, d.buckets, labelValues...)
	}

	func (s SummaryData) Metric(desc prometheus.Desc, labelValues ...string) prometheus.Metric {
	return prometheus.MustNewConstSummary(desc, s.sampleCount, s.sampleSum, s.quantiles, labelValues...)
	}

Compactor fails with concurrent map iteration and map write #4480

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Compactor fails with `concurrent map iteration and map write` #4480