Skip to content

Commit

Permalink
Aggressively cleanup failed deployments. (kubeflow#392)
Browse files Browse the repository at this point in the history
* Aggressively cleanup failed deployments.

* If a deployment is in error state; we want to clean it up pretty quickly
  and not wait several hours.

  * The problem is if deployments start failing because of quota issues
    these will stack up and we may not recover. But if we aggresively
    delete failed deployments this should help.

* Related to kubeflow#391

* Fix lint.
  • Loading branch information
jlewi authored and k8s-ci-robot committed May 14, 2019
1 parent 2f7d5e1 commit 449485c
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion py/kubeflow/testing/cleanup_ci.py
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,15 @@ def cleanup_deployments(args): # pylint: disable=too-many-statements,too-many-br
full_insert_time = d.get("insertTime")
age = getAge(full_insert_time)

if age > datetime.timedelta(hours=args.max_age_hours):
if d.get("operation", {}).has_key("error"):
# Prune failed deployments more aggressively
logging.info("Deployment %s is in error state %s",
d.get("name"), d.get("operation").get("error"))
max_age = datetime.timedelta(minutes=10)
else:
max_age = datetime.timedelta(hours=args.max_age_hours)

if age > max_age:
# Get the zone.
if "update" in d:
manifest_url = d["update"]["manifest"]
Expand Down

0 comments on commit 449485c

Please sign in to comment.