Skaffold fails when a container's status code is 1 #5210

Seb-C · 2021-01-07T08:16:11Z

I have different pods that depends on mysql to start.
Each one of those pods implements the proper readinessProbe and livenessProbe to check the status of the pod.
If the application cannot start, the container stops with exit 1 because connecting to MySQL is necessary to get the application to work.
Since there is no management of dependencies in kubernetes, the recommended way to handle this is to have the pod restart the containers until it works.

Currently, when I run skaffold dev, it fails and deletes the resources immediately instead of waiting. If I do skaffold run, it also reports a failure, but then 2~3 seconds later the pods created by skaffold are running properly in the cluster.

Expected behavior

Skaffold waits for kubernetes to get the pod running and does not check the individual container's exit codes.
It should only stop and delete the resources if the pod still fails after statusCheckDeadlineSeconds seconds.

Actual behavior

Skaffold does not start the dev mode and deletes the resources from the cluster.

 - deployment/mysql is ready. [2/3 deployment(s) still pending]
 - deployment/service-b: container service-b terminated with exit code 1
    - pod/service-b-59995f6f5f-bmg74: container service-b terminated with exit code 1
      > [service-b-59995f6f5f-bmg74 service-b] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-b failed. Error: container service-b terminated with exit code 1.
 - deployment/service-a: container service-a terminated with exit code 1
    - pod/service-a-7697f8bdd4-gg9rv: container service-a terminated with exit code 1
      > [service-a-7697f8bdd4-gg9rv service-a] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-a failed. Error: container service-a terminated with exit code 1.
 Cleaning up...
 - configmap "mysql" deleted
 - deployment.apps "mysql" deleted
 - service "mysql" deleted
 - deployment.apps "service-a" deleted
 - service "service-a" deleted
 - deployment.apps "service-b" deleted
 - service "service-b" deleted
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Information

Skaffold version: bleeding edge 35214eb
Operating system: Debian testing / minikube
Contents of skaffold.yaml:

apiVersion: skaffold/v2beta10
kind: Config
build:
  local:
    concurrency: 0
    useBuildkit: true
  artifacts:
    [confidential]
deploy:
  statusCheckDeadlineSeconds: 60
  kubectl:
    manifests:
      - ./build/kubernetes/*
test:
  - image: [confidential]
    structureTests:
      - './build/tests/*'

Steps to reproduce the behavior

Have a pod that exit 1 before mysql starts, and then runs successfully.

The text was updated successfully, but these errors were encountered:

nkubala · 2021-01-08T21:32:13Z

@Seb-C this does sound like a bad experience. are you saying that skaffold immediately fails when one of the containers exits with error code 1, not after waiting the allotted 60 seconds for the status check?

would you be able to provide a small k8s manifest to reproduce this issue?

Seb-C · 2021-01-14T05:55:12Z

@nkubala I am sorry but I don't really have time to prepare an environment to show this. I think that you can reproduce this just with a defined statusCheckDeadlineSeconds and a deployment whose container returns a non-zero exit code.

tejal29 · 2021-02-23T23:25:08Z

This is expected. skaffold dev exits if first deployment fails.
A container failing with exit code 1 is non recoverable error and hence status check phase fails.

This kind of relates to our idea of making 1st devloop not exit.

tejal29 · 2021-02-23T23:25:32Z

duplicate of #4953

Seb-C · 2021-02-24T01:25:29Z

@tejal29 I disagree with closing this ticket. This is a different issue than the first dev loop problem you linked.

A container failing with exit code 1 is non recoverable error and hence status check phase fails.
A container failing is restarted by k8s until it works, this is the recommended way to do it.

Currently, starting the first dev loop is impossible, independently of any code issue. Even if my pod is still in starting status in k8s, skaffold fails it.
With probes, it is possible to have a healthy pod containing failed containers.

I expect Skaffold to handle my k8s objects, and kubernetes to handle my containers. But currently skaffold is also doing checks at the containers level, which short circuits my probes.

This mishandling of responsibilities is forcing me to write code that behaves differently in kubernetes (prod) and skaffold (local).

tejal29 · 2021-02-24T06:41:24Z

I expect Skaffold to handle my k8s objects, and kubernetes to handle my containers. But currently skaffold is also doing checks at the containers level, which short circuits my probes.

@Seb-C Can you please re-run with --status-check=false. Skaffold performs status check on by default. To opt out of status check you can use the above flag.
The disadvantage of disabling status check is, in case you have port-froward flag enabled, or user defines resources, skaffold will attempt to port-forward those even if they are not ready.

Hope that answers your questoin

Seb-C · 2021-02-25T03:42:04Z

@tejal29 Thank you for your answer.
I tried to reproduce the problem to try your solution, but my stack changed a lot since I first opened this issue and I am now unable to reproduce it.
I believe it is either because I switched to helm (which may handle the status checks differently) or updated Skaffold (I think an issue related to the status check timeout has been resolved).

casret · 2021-03-24T20:48:20Z

@tejal29 I ran into this today when I was upgrading from 1.7.0. It's definitely a regression. 1.7 would wait until the deployment was stable, without caring about the exit codes of the container. I think that's the correct behavior from a k8s point of view, since containers should be considered ephemeral and restartable. Can you re-open this one, or would you rather I opened a new issue?

nkubala · 2021-04-01T21:29:32Z

from @casret:

I want to reopen it because it's a big change of behavior from previous version and is blocking up from upgrading (from 1.7.0). If you could either revert the change in behavior or allow a flag to allow the old behavior that would be great. Just turning off the status-check does not work because in CI/CD we do want to check if the deployment fully progressed, which it would if skaffold didn't abort it prematurely.

Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g.

Remove these from the unrecoverable errors list. Containers are ephemeral in k8s, so errors in them may be recoverable at a system level. E.g. when they are waiting for another resource to stablelize.

tejal29 · 2021-06-04T18:07:09Z

closing this due to inactivity.

foobarbecue · 2021-07-26T21:53:45Z

I have this same issue, and it's a show-stopper for me. Please re-open.

foobarbecue · 2021-08-02T23:15:39Z

Glad to see that it has been re-opened. Some more context for me: I use skaffold to set up https://github.com/argoproj/argo-workflows and that fails because argo-server and another pod normally fail a couple of times while mysql is starting up. I've managed to add an init container to the argo pods to make them wait for mysql as a workaround, but it would be much cleaner if I could disable skaffold's "panic on container exit" behavior.

nkubala · 2021-08-04T20:42:35Z

@foobarbecue thanks for the context. I think probably what we should do here is expose a allowPodRestart flag or something similar that backs off the status check when it sees a failure, to give the pods time to go through some restart cycles before actually calling the deployments failed.

briandealwis · 2021-08-19T03:33:37Z

@tejal29 suggested in #5790 (comment):

This behavior was added intentionally to quit status check on exited containers especially for GCP cloud run integration.
With that said, I do understand your use case.

Maybe we could achieve this via statusCheck config.
statusCheck:
   waitOnContainerTermninated: true

and

This behavior was added for Cloud run to make sure status check terminates quickly.

This needs further exploration on

When to terminate quickly Vs when to wait.

Shd it be user configurable?

dallasvaughan · 2022-08-22T14:01:33Z

Has there been any update on enabling the previous (<1.7) behavior? We have a large number of applications that depend on the container restarting (usually just once) when a dependent DB takes a little longer to start up and become ready for connections.

Having skaffold consider a deployment "failed" and quit before the specified k8s failureThreshold is reached, with no option to make the correct k8s behavior take precedence seems like a pretty major regression.

aaron-prindle · 2022-12-05T19:33:17Z

Recently this PR was merged:
#8047

Which adds the --tolerate-failures-until-deadline=[true|false] flag to Skaffold as well as the below skaffold config option

deploy:
  tolerateFailuresUntilDeadline: true #false is default

to skaffold. This has not been added to our docs site yet, there is an issue tracking that here:
#8060

With the option enabled, Skaffold will wait for all containers to be successful until the given statusCheckDeadlineSeconds timeout (vs the normal behaviour of failing if a single deploy/container fails). This way "flapping" deployments can be supported better for dev and CI/CD usage

I believe the feature above should resolve this issue. Will wait until the docs issue/PR is closed and then I will close this.

SuperCoolAlan · 2023-03-24T20:07:01Z

The flag --tolerate-failures-until-deadline=[true|false] seems to exit out of skaffold dev while retrying, and thus hot-reload development feature is not available.

Seb-C changed the title ~~Skaffold fails when status code is not 0~~ Skaffold fails when a container's status code is not 0 Jan 7, 2021

Seb-C changed the title ~~Skaffold fails when a container's status code is not 0~~ Skaffold fails when a container's status code is 1 Jan 7, 2021

nkubala added area/deploy area/status-check kind/bug Something isn't working priority/p1 High impact feature/bug. labels Jan 8, 2021

tejal29 closed this as completed Feb 23, 2021

Seb-C mentioned this issue Feb 24, 2021

[Meta] Reopen issue wrongly marked as duplicate #5444

Closed

tejal29 reopened this Feb 24, 2021

tejal29 added kind/question User question and removed kind/bug Something isn't working priority/p1 High impact feature/bug. labels Feb 24, 2021

Seb-C closed this as completed Feb 25, 2021

casret mentioned this issue Mar 30, 2021

Skaffold fails when a container's status code is 1 #5617

Closed

nkubala reopened this Apr 1, 2021

nkubala added help wanted We would love to have this done, but don't have the bandwidth, need help from contributors kind/bug Something isn't working priority/p2 May take a couple of releases and removed kind/question User question labels Apr 1, 2021

casret pushed a commit to casret/skaffold that referenced this issue May 5, 2021

Fix for GoogleContainerTools#5210

7e43e12

Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g.

casret added a commit to casret/skaffold that referenced this issue May 5, 2021

Fix for GoogleContainerTools#5210

97ee30b

Remove these errors from the unrecoverable list. Container errors are recoverable in a K8S environment, they may be waiting for another resource to become stable e.g.

casret mentioned this issue May 5, 2021

Fix for #5210 #5790

Closed

tejal29 closed this as completed Jun 4, 2021

nkubala reopened this Aug 2, 2021

gsquared94 added priority/p3 agreed that this would be good to have, but no one is available at the moment. and removed priority/p2 May take a couple of releases labels Oct 6, 2021

ansky mentioned this issue May 6, 2022

statusCheck in skaffold.yaml is being ignored #7382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skaffold fails when a container's status code is 1 #5210

Skaffold fails when a container's status code is 1 #5210

Seb-C commented Jan 7, 2021 •

edited

Loading

nkubala commented Jan 8, 2021

Seb-C commented Jan 14, 2021 •

edited

Loading

tejal29 commented Feb 23, 2021

tejal29 commented Feb 23, 2021

Seb-C commented Feb 24, 2021 •

edited

Loading

tejal29 commented Feb 24, 2021 •

edited

Loading

Seb-C commented Feb 25, 2021

casret commented Mar 24, 2021

nkubala commented Apr 1, 2021

tejal29 commented Jun 4, 2021

foobarbecue commented Jul 26, 2021

foobarbecue commented Aug 2, 2021

nkubala commented Aug 4, 2021

briandealwis commented Aug 19, 2021

dallasvaughan commented Aug 22, 2022

aaron-prindle commented Dec 5, 2022 •

edited

Loading

SuperCoolAlan commented Mar 24, 2023

Skaffold fails when a container's status code is 1 #5210

Skaffold fails when a container's status code is 1 #5210

Comments

Seb-C commented Jan 7, 2021 • edited Loading

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

nkubala commented Jan 8, 2021

Seb-C commented Jan 14, 2021 • edited Loading

tejal29 commented Feb 23, 2021

tejal29 commented Feb 23, 2021

Seb-C commented Feb 24, 2021 • edited Loading

tejal29 commented Feb 24, 2021 • edited Loading

Seb-C commented Feb 25, 2021

casret commented Mar 24, 2021

nkubala commented Apr 1, 2021

tejal29 commented Jun 4, 2021

foobarbecue commented Jul 26, 2021

foobarbecue commented Aug 2, 2021

nkubala commented Aug 4, 2021

briandealwis commented Aug 19, 2021

dallasvaughan commented Aug 22, 2022

aaron-prindle commented Dec 5, 2022 • edited Loading

SuperCoolAlan commented Mar 24, 2023

Seb-C commented Jan 7, 2021 •

edited

Loading

Seb-C commented Jan 14, 2021 •

edited

Loading

Seb-C commented Feb 24, 2021 •

edited

Loading

tejal29 commented Feb 24, 2021 •

edited

Loading

aaron-prindle commented Dec 5, 2022 •

edited

Loading