You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to Pod is blocking scale down because it has local storage and Pod is blocking scale down because it's not backed by a controller in GKE:
What is the best practice here? Both these warning suggest adding a safe-to-evict annotation - is this safe to add?
Worth noting that both CPU and memory utilisation are low:
Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.
Diagnostics
This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.
We see ~13k logs / hour on local storage scale down issues:
And ~100 logs / hour on the no controller scale down issue:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered:
Hello, no one answered his question though, GKE needs pods to be managed by something deployment, statefulset etc or else use the safe-to-evict annotation to be able to scale down the number of nodes in the cluster. Is this argo "orchestrator" pod safe to evict ? (I don't know the implementation details but if it retries until success without side effects I'd consider it "safe" enough to be evicted on scale downs)
Checklist
Versions
Summary
After a large event kicked off 100s of workflows several weeks ago, our prod cluster has not been able to scale back down. Both workflow and event pods block the cluster scale down due to
Pod is blocking scale down because it has local storage
andPod is blocking scale down because it's not backed by a controller
in GKE:What is the best practice here? Both these warning suggest adding a
safe-to-evict
annotation - is this safe to add?Worth noting that both CPU and memory utilisation are low:
Additionally, we've implemented pod disruption budgets to reduce the chance of voluntary disruption of workflow pods. In the meantime are investigating internally if this could be one factor blocking the scale down after a surge.
Diagnostics
This can be reproduced by kicking off +100 workflows that sleep for 1000+ seconds.
We see ~13k logs / hour on local storage scale down issues:
And ~100 logs / hour on the no controller scale down issue:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: