Ingress controller reconciles all endpoints sequentially causing delay that results in 502s #1103
Description
When a pod is deleted from k8s, ingress controller receives endpoint change event and reconciles all endpoints sequentially. As number of ingress resources are added, the delay for reconciliation with k8s cluster gets longer. This is very easy to reproduce because endpoints are reconciled in alphabetical order. Updating endpoint for last service will take the longest.
In our case, we have 8 ingress resources. It takes at least 15 seconds for AWS ingress controller to reconcile with k8s cluster. Once pod has been terminated, this results in 502
.
The workaround is to add preStop delay to the pod. We have added 30 seconds delay to resolve this issue. However, this is not scalable as more ingress resources are being added to the cluster.
Controller should only reconcile endpoint being changed to improve efficiency. If not, endpoints could be reconciled in parallel.
There is already a TODO
for this.
https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/31aa413a6b63ebc4e30b400f94613fda485c0a2b/internal/ingress/controller/handlers/endpoints.go#L52