Description
Describe the bug
Meberlist recent changes are causing multiple and different pods (ingesters, distributors, store-gateways, etc) to be OOM killed at the same time.
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (SHA or version): e0807c4
- Perform Operations(Read/Write/Others)
We didn't find a way to consistently reproduce the problem, but we have a load test environment with 200 ingesters and distributors in which we started to see multiple pods getting OOM killed from time to time, specially during new deployments / rolling updates. Using this environment we noticed that the OOM issues stated with this PR to upgrade dskit: #4601
Looking at memory metrics, we see an abruptly increase of memory before the pods are killed. Example:
And logs from OOM pods show the following type of warn/error messages before they are killed/restarted:
level=warn ts=2022-03-05T21:34:10.655349368Z caller=memberlist_client.go:941 msg="failed to unmarshal received KV Pair" err="unexpected EOF"
ts=2022-03-05T21:34:10.660091926Z caller=memberlist_logger.go:74 level=error msg="msg type (183) not supported from=10.0.39.1:7946"
ts=2022-03-05T21:34:10.659977436Z caller=memberlist_logger.go:74 level=error msg="msg type (157) not supported from=10.0.39.1:7946"
ts=2022-03-05T21:34:10.660058335Z caller=memberlist_logger.go:74 level=error msg="msg type (228) not supported from=10.0.39.1:7946"
ts=2022-03-05T21:34:10.660077938Z caller=memberlist_logger.go:74 level=error msg="msg type (12) not supported fr
level=warn ts=2022-03-05T15:21:01.478446259Z caller=memberlist_client.go:941 msg="failed to unmarshal received KV Pair" err="unexpected EOF"
ts=2022-03-05T15:21:01.477712759Z caller=memberlist_logger.go:74 level=error msg="Failed to decode nack response: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2) from=10.0.24.213:7946"
ts=2022-03-05T15:21:08.087633907Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to ingester-122-a8baa05b but other probes failed, network may be misconfigured"
Expected behavior
Pods are not OOM killed without any increase in traffic or number of series received.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: helm
Storage Engine
- Blocks
- Chunks
Additional Context