Aggressively cleanup failed deployments. (kubeflow#392)

* Aggressively cleanup failed deployments. * If a deployment is in error state; we want to clean it up pretty quickly and not wait several hours. * The problem is if deployments start failing because of quota issues these will stack up and we may not recover. But if we aggresively delete failed deployments this should help. * Related to kubeflow#391 * Fix lint.
Linchin · May 14, 2019 · 449485c · 449485c
1 parent 2f7d5e1
commit 449485c
Showing 1 changed file with 9 additions and 1 deletion.
diff --git a/py/kubeflow/testing/cleanup_ci.py b/py/kubeflow/testing/cleanup_ci.py
@@ -690,7 +690,15 @@ def cleanup_deployments(args): # pylint: disable=too-many-statements,too-many-br
     full_insert_time = d.get("insertTime")
     age = getAge(full_insert_time)
 
-    if age > datetime.timedelta(hours=args.max_age_hours):
+    if d.get("operation", {}).has_key("error"):
+      # Prune failed deployments more aggressively
+      logging.info("Deployment %s is in error state %s",
+                   d.get("name"), d.get("operation").get("error"))
+      max_age = datetime.timedelta(minutes=10)
+    else:
+      max_age = datetime.timedelta(hours=args.max_age_hours)
+
+    if age > max_age:
       # Get the zone.
       if "update" in d:
         manifest_url = d["update"]["manifest"]