Skip to content

Compactor fails with concurrent map iteration and map write #4480

Closed
@siggy

Description

@siggy

Describe the bug
We observed a compactor pod fail due to concurrent map iteration and map write:

goroutine 597216 [running]:
runtime.throw(0x275e448, 0x26)
	/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc002462b60 sp=0xc002462b30 pc=0x438652
runtime.mapiternext(0xc002462c70)
	/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc002462be0 sp=0xc002462b60 pc=0x410d2c
github.com/prometheus/client_golang/prometheus.(*constHistogram).Write(0xc001b9e400, 0xc0006fcd90, 0x2, 0x2)
	/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:556 +0x179 fp=0xc002462ce0 sp=0xc002462be0 pc=0x880cb9
github.com/prometheus/client_golang/prometheus.processMetric(0x2bbe460, 0xc001b9e400, 0xc002463050, 0xc002463080, 0x0, 0x0, 0x1)
	/__w/cortex/cortex/vendor/github.com/prometheus/client_golang/prometheus/registry.go:598 +0xa2 fp=0xc002462e08 sp=0xc002462ce0 pc=0x885f42

Full output: https://gist.github.com/siggy/8c0cd18a649c78f4cc28d16c19edbedc

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex (v1.10.0)
  2. Run compactor (this is the first time we've seen this after months of usage).

Expected behavior
Do not fail with concurrent map iteration and map write.

Environment:

  • Infrastructure: AKS v1.21.1
  • Deployment tool: hand-rolled yaml, compactor as 3-instance StatefulSet

Storage Engine

  • Blocks
  • Chunks

Additional Context

The log reported the error at range h.buckets, iterating over the h.buckets map in client_golang's constHistogram:

func (h *constHistogram) Write(out *dto.Metric) error {
his := &dto.Histogram{}
buckets := make([]*dto.Bucket, 0, len(h.buckets))
his.SampleCount = proto.Uint64(h.count)
his.SampleSum = proto.Float64(h.sum)
for upperBound, count := range h.buckets {
buckets = append(buckets, &dto.Bucket{
CumulativeCount: proto.Uint64(count),
UpperBound: proto.Float64(upperBound),
})
}

I don't immediately see where writes to h.buckets could be happening, but I'm guessing it's racey with Cortex's HistogramData, because it shares the buckets map with constHistogram:

// Return prometheus metric from this histogram data.
func (d *HistogramData) Metric(desc *prometheus.Desc, labelValues ...string) prometheus.Metric {
return prometheus.MustNewConstHistogram(desc, d.sampleCount, d.sampleSum, d.buckets, labelValues...)
}

One possible fix would be to clone the map in HistogramData.Metric() prior to passing it into prometheus.MustNewConstHistogram. If we went that route, there are similar places in the code that may also need this fix:

func (s *SummaryData) Metric(desc *prometheus.Desc, labelValues ...string) prometheus.Metric {
return prometheus.MustNewConstSummary(desc, s.sampleCount, s.sampleSum, s.quantiles, labelValues...)
}

That said, I see some mutex protection already in-place in HistogramDataCollector, so not sure whether similar mutex protection would be a more holistic fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions