Description
Describe the bug
Ingester stops CAS for ring KV store and hangs until the pod is restarted or replaced. This does not always happen but when it does, it's caused by the KV store lifecycler calling ddbClient.QueryPagesWithContext
. The client does not have a timeout so it hangs indefinitely if there is a problem with DynamoDB returning a response.
A suggested fix for this is to add a timeout configuration for DynamoDB similar to the one for consul
cortex/pkg/ring/kv/consul/client.go
Line 45 in f74b4cd
To Reproduce
Steps to reproduce the behavior:
- Deploy Cortex as multiple processes in a k8s cluster, and DynamoDB configured as the ring KV store.
- Run ingester service as a k8s statefulset with 750 pods and trigger a rolling update.
- Observe that counter metric
dynamodb_kv_cas_attempt_total
stops increasing for ingester X. - Restart pod for ingester X and counter metric
dynamodb_kv_cas_attempt_total
increases steadily again.
Expected behavior
The ingester heartbeat does not stop and there is no plateau behaviour with the dynamodb_kv_cas_attempt_total
metric.
Environment:
- Infrastructure: Kubernetes 1.28
- Deployment tool: Helm
Metadata
Metadata
Assignees
Labels
No labels