Distributors OOM on a single slow ingester in the cluster

Yesterday we've got all distributors continuously `OOMKilled` in one of our Cortex clusters. The root cause analysis outlined this issue has been caused by a single ingester which was running on a failing Kubernetes node which was running but very slow.

This issue is due to how the quorum works. When the distributors receive a `Push()` request, the time series are sharded and then sent to 3 ingesters (we have a replication factor of `3`). The distributor's `Push()` request completes as soon as all series are pushed to at least 2 ingesters.

In the case of a very slow ingester, the distributor piles up the number of in-flight requests towards the slow ingester, while the inbound `Push()` request is completed as soon as the other ingesters successfully complete the ingestion.

This causes the memory used by the distributors to increase due to the in-flight requests towards the slow ingester.

In a high traffic Cortex cluster, distributors can hit the memory limit before the timeout of the in-flight requests towards the slow ingester is expired, causing all distributors to be `OOMKilled` (and subsequent distributors restarts will OOM again until the very slow ingester is not removed from the ring).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributors OOM on a single slow ingester in the cluster #1895

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributors OOM on a single slow ingester in the cluster #1895

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions