Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skaffold fails when a container's status code is 1 #5210

Open
Seb-C opened this issue Jan 7, 2021 · 17 comments
Open

Skaffold fails when a container's status code is 1 #5210

Seb-C opened this issue Jan 7, 2021 · 17 comments
Labels
area/deploy area/status-check help wanted We would love to have this done, but don't have the bandwidth, need help from contributors kind/bug Something isn't working priority/p3 agreed that this would be good to have, but no one is available at the moment.

Comments

@Seb-C
Copy link

Seb-C commented Jan 7, 2021

I have different pods that depends on mysql to start.
Each one of those pods implements the proper readinessProbe and livenessProbe to check the status of the pod.
If the application cannot start, the container stops with exit 1 because connecting to MySQL is necessary to get the application to work.
Since there is no management of dependencies in kubernetes, the recommended way to handle this is to have the pod restart the containers until it works.

Currently, when I run skaffold dev, it fails and deletes the resources immediately instead of waiting. If I do skaffold run, it also reports a failure, but then 2~3 seconds later the pods created by skaffold are running properly in the cluster.

Expected behavior

Skaffold waits for kubernetes to get the pod running and does not check the individual container's exit codes.
It should only stop and delete the resources if the pod still fails after statusCheckDeadlineSeconds seconds.

Actual behavior

Skaffold does not start the dev mode and deletes the resources from the cluster.

 - deployment/mysql is ready. [2/3 deployment(s) still pending]
 - deployment/service-b: container service-b terminated with exit code 1
    - pod/service-b-59995f6f5f-bmg74: container service-b terminated with exit code 1
      > [service-b-59995f6f5f-bmg74 service-b] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-b failed. Error: container service-b terminated with exit code 1.
 - deployment/service-a: container service-a terminated with exit code 1
    - pod/service-a-7697f8bdd4-gg9rv: container service-a terminated with exit code 1
      > [service-a-7697f8bdd4-gg9rv service-a] 2021/01/07 07:46:00 Failed to initialize the migration driver dial tcp 10.97.179.35:3306: connect: connection refused
 - deployment/service-a failed. Error: container service-a terminated with exit code 1.
 Cleaning up...
 - configmap "mysql" deleted
 - deployment.apps "mysql" deleted
 - service "mysql" deleted
 - deployment.apps "service-a" deleted
 - service "service-a" deleted
 - deployment.apps "service-b" deleted
 - service "service-b" deleted
exiting dev mode because first deploy failed: 2/3 deployment(s) failed

Information

  • Skaffold version: bleeding edge 35214eb
  • Operating system: Debian testing / minikube
  • Contents of skaffold.yaml:
apiVersion: skaffold/v2beta10
kind: Config
build:
  local:
    concurrency: 0
    useBuildkit: true
  artifacts:
    [confidential]
deploy:
  statusCheckDeadlineSeconds: 60
  kubectl:
    manifests:
      - ./build/kubernetes/*
test:
  - image: [confidential]
    structureTests:
      - './build/tests/*'

Steps to reproduce the behavior

Have a pod that exit 1 before mysql starts, and then runs successfully.

@Seb-C Seb-C changed the title Skaffold fails when status code is not 0 Skaffold fails when a container's status code is not 0 Jan 7, 2021
@Seb-C Seb-C changed the title Skaffold fails when a container's status code is not 0 Skaffold fails when a container's status code is 1 Jan 7, 2021
@nkubala nkubala added area/deploy area/status-check kind/bug Something isn't working priority/p1 High impact feature/bug. labels Jan 8, 2021
@nkubala
Copy link
Contributor

nkubala commented Jan 8, 2021

@Seb-C this does sound like a bad experience. are you saying that skaffold immediately fails when one of the containers exits with error code 1, not after waiting the allotted 60 seconds for the status check?

would you be able to provide a small k8s manifest to reproduce this issue?

@Seb-C
Copy link
Author

Seb-C commented Jan 14, 2021

@nkubala I am sorry but I don't really have time to prepare an environment to show this. I think that you can reproduce this just with a defined statusCheckDeadlineSeconds and a deployment whose container returns a non-zero exit code.

@tejal29
Copy link
Contributor

tejal29 commented Feb 23, 2021

This is expected. skaffold dev exits if first deployment fails.
A container failing with exit code 1 is non recoverable error and hence status check phase fails.

This kind of relates to our idea of making 1st devloop not exit.

@tejal29
Copy link
Contributor

tejal29 commented Feb 23, 2021

duplicate of #4953

@tejal29 tejal29 closed this as completed Feb 23, 2021
@Seb-C
Copy link
Author

Seb-C commented Feb 24, 2021

@tejal29 I disagree with closing this ticket. This is a different issue than the first dev loop problem you linked.

A container failing with exit code 1 is non recoverable error and hence status check phase fails.
A container failing is restarted by k8s until it works, this is the recommended way to do it.

Currently, starting the first dev loop is impossible, independently of any code issue. Even if my pod is still in starting status in k8s, skaffold fails it.
With probes, it is possible to have a healthy pod containing failed containers.

I expect Skaffold to handle my k8s objects, and kubernetes to handle my containers. But currently skaffold is also doing checks at the containers level, which short circuits my probes.

This mishandling of responsibilities is forcing me to write code that behaves differently in kubernetes (prod) and skaffold (local).

@tejal29
Copy link
Contributor

tejal29 commented Feb 24, 2021

I expect Skaffold to handle my k8s objects, and kubernetes to handle my containers. But currently skaffold is also doing checks at the containers level, which short circuits my probes.

@Seb-C Can you please re-run with --status-check=false. Skaffold performs status check on by default. To opt out of status check you can use the above flag.
The disadvantage of disabling status check is, in case you have port-froward flag enabled, or user defines resources, skaffold will attempt to port-forward those even if they are not ready.

Hope that answers your questoin

@tejal29 tejal29 reopened this Feb 24, 2021
@tejal29 tejal29 added kind/question User question and removed kind/bug Something isn't working priority/p1 High impact feature/bug. labels Feb 24, 2021
@Seb-C
Copy link
Author

Seb-C commented Feb 25, 2021

@tejal29 Thank you for your answer.
I tried to reproduce the problem to try your solution, but my stack changed a lot since I first opened this issue and I am now unable to reproduce it.
I believe it is either because I switched to helm (which may handle the status checks differently) or updated Skaffold (I think an issue related to the status check timeout has been resolved).

@Seb-C Seb-C closed this as completed Feb 25, 2021
@casret
Copy link

casret commented Mar 24, 2021

@tejal29 I ran into this today when I was upgrading from 1.7.0. It's definitely a regression. 1.7 would wait until the deployment was stable, without caring about the exit codes of the container. I think that's the correct behavior from a k8s point of view, since containers should be considered ephemeral and restartable. Can you re-open this one, or would you rather I opened a new issue?

@nkubala
Copy link
Contributor

nkubala commented Apr 1, 2021

from @casret:

I want to reopen it because it's a big change of behavior from previous version and is blocking up from upgrading (from 1.7.0). If you could either revert the change in behavior or allow a flag to allow the old behavior that would be great. Just turning off the status-check does not work because in CI/CD we do want to check if the deployment fully progressed, which it would if skaffold didn't abort it prematurely.

@nkubala nkubala reopened this Apr 1, 2021
@nkubala nkubala added help wanted We would love to have this done, but don't have the bandwidth, need help from contributors kind/bug Something isn't working priority/p2 May take a couple of releases and removed kind/question User question labels Apr 1, 2021
casret pushed a commit to casret/skaffold that referenced this issue May 5, 2021
Remove these errors from the unrecoverable list.  Container errors are
recoverable in a K8S environment, they may be waiting for another
resource to become stable e.g.
casret added a commit to casret/skaffold that referenced this issue May 5, 2021
Remove these errors from the unrecoverable list.  Container errors are
recoverable in a K8S environment, they may be waiting for another
resource to become stable e.g.
casret added a commit to casret/skaffold that referenced this issue May 5, 2021
Remove these from the unrecoverable errors list.  Containers are
ephemeral in k8s, so errors in them may be recoverable at a system
level.  E.g. when they are waiting for another resource to stablelize.
@casret casret mentioned this issue May 5, 2021
@tejal29
Copy link
Contributor

tejal29 commented Jun 4, 2021

closing this due to inactivity.

@tejal29 tejal29 closed this as completed Jun 4, 2021
@foobarbecue
Copy link

I have this same issue, and it's a show-stopper for me. Please re-open.

@nkubala nkubala reopened this Aug 2, 2021
@foobarbecue
Copy link

Glad to see that it has been re-opened. Some more context for me: I use skaffold to set up https://github.com/argoproj/argo-workflows and that fails because argo-server and another pod normally fail a couple of times while mysql is starting up. I've managed to add an init container to the argo pods to make them wait for mysql as a workaround, but it would be much cleaner if I could disable skaffold's "panic on container exit" behavior.

@nkubala
Copy link
Contributor

nkubala commented Aug 4, 2021

@foobarbecue thanks for the context. I think probably what we should do here is expose a allowPodRestart flag or something similar that backs off the status check when it sees a failure, to give the pods time to go through some restart cycles before actually calling the deployments failed.

@briandealwis
Copy link
Member

@tejal29 suggested in #5790 (comment):

This behavior was added intentionally to quit status check on exited containers especially for GCP cloud run integration.
With that said, I do understand your use case.

Maybe we could achieve this via statusCheck config.

statusCheck:
   waitOnContainerTermninated: true

and

This behavior was added for Cloud run to make sure status check terminates quickly.

This needs further exploration on

  1. When to terminate quickly Vs when to wait.
  2. Shd it be user configurable?

@gsquared94 gsquared94 added priority/p3 agreed that this would be good to have, but no one is available at the moment. and removed priority/p2 May take a couple of releases labels Oct 6, 2021
@dallasvaughan
Copy link

Has there been any update on enabling the previous (<1.7) behavior? We have a large number of applications that depend on the container restarting (usually just once) when a dependent DB takes a little longer to start up and become ready for connections.

Having skaffold consider a deployment "failed" and quit before the specified k8s failureThreshold is reached, with no option to make the correct k8s behavior take precedence seems like a pretty major regression.

@aaron-prindle
Copy link
Contributor

aaron-prindle commented Dec 5, 2022

Recently this PR was merged:
#8047

Which adds the --tolerate-failures-until-deadline=[true|false] flag to Skaffold as well as the below skaffold config option

deploy:
  tolerateFailuresUntilDeadline: true #false is default

to skaffold. This has not been added to our docs site yet, there is an issue tracking that here:
#8060

With the option enabled, Skaffold will wait for all containers to be successful until the given statusCheckDeadlineSeconds timeout (vs the normal behaviour of failing if a single deploy/container fails). This way "flapping" deployments can be supported better for dev and CI/CD usage

I believe the feature above should resolve this issue. Will wait until the docs issue/PR is closed and then I will close this.

@SuperCoolAlan
Copy link

The flag --tolerate-failures-until-deadline=[true|false] seems to exit out of skaffold dev while retrying, and thus hot-reload development feature is not available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deploy area/status-check help wanted We would love to have this done, but don't have the bandwidth, need help from contributors kind/bug Something isn't working priority/p3 agreed that this would be good to have, but no one is available at the moment.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants