Skip to content

Ingester stops Ring CAS when lifecycler hangs on query to KV store #6211

@anna-tran

Description

@anna-tran

Describe the bug
Ingester stops CAS for ring KV store and hangs until the pod is restarted or replaced. This does not always happen but when it does, it's caused by the KV store lifecycler calling ddbClient.QueryPagesWithContext. The client does not have a timeout so it hangs indefinitely if there is a problem with DynamoDB returning a response.

A suggested fix for this is to add a timeout configuration for DynamoDB similar to the one for consul

HTTPClientTimeout time.Duration `yaml:"http_client_timeout"`

To Reproduce
Steps to reproduce the behavior:

  1. Deploy Cortex as multiple processes in a k8s cluster, and DynamoDB configured as the ring KV store.
  2. Run ingester service as a k8s statefulset with 750 pods and trigger a rolling update.
  3. Observe that counter metric dynamodb_kv_cas_attempt_total stops increasing for ingester X.
  4. Restart pod for ingester X and counter metric dynamodb_kv_cas_attempt_total increases steadily again.

Expected behavior
The ingester heartbeat does not stop and there is no plateau behaviour with the dynamodb_kv_cas_attempt_total metric.

Environment:

  • Infrastructure: Kubernetes 1.28
  • Deployment tool: Helm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions