Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RetryStrategy doesn't apply to daemon container #2963

Closed
2 of 4 tasks
cy-zheng opened this issue May 6, 2020 · 11 comments · Fixed by #3008
Closed
2 of 4 tasks

RetryStrategy doesn't apply to daemon container #2963

cy-zheng opened this issue May 6, 2020 · 11 comments · Fixed by #3008
Assignees
Labels

Comments

@cy-zheng
Copy link
Contributor

cy-zheng commented May 6, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:

When a worker node inside k8s cluster crashed, daemon container running on it couldn't restart.
The daemon container becomes "Error" state, however I've set up its retry strategy like

retryStrategy:
      limit: 10
      retryPolicy: Always

What you expected to happen:

Argo controller should recreate the daemon container pod.

How to reproduce it (as minimally and precisely as possible):

Just use kubectl delete to delete one of running daemon container's pod.

Anything else we need to know?:

In fact, I think we could design "daemon container" better.
One of reason that daemon container cannot restart I guess, is the pod ip of daemon container has already been passed to following DAG node. I think a better solution for it, is binding a k8s service object to daemon container(s), and then pass cluster IP of the service to following DAG nodes.
That could also enable load balancing mechanism between multiple daemon containers.

Environment:

  • Argo version:
    v2.7.5

  • Kubernetes version :
    v1.13.12


Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7 simster7 self-assigned this May 6, 2020
@simster7
Copy link
Member

simster7 commented May 6, 2020

Good suggestion, I'll look into this

@cy-zheng
Copy link
Contributor Author

cy-zheng commented May 7, 2020

By the way, what do you think about manage daemon containers by k8s deployment/statefulset object? @simster7 I think daemon containers is designed for background long-running services, so some configurations like RetryStrategy.Limit are useless in such cases.

@simster7
Copy link
Member

simster7 commented May 7, 2020

Our limitation with using deployments or stateful sets is that IP addresses of the Pods are not guaranteed to stay the same across restarts.

Such a functionality would require Argo or the user to set up a Service in order to interact with them reliably

@cy-zheng
Copy link
Contributor Author

cy-zheng commented May 8, 2020

Yes. I think a good design is to use the deployment, stateful set to achieve the function of the daemon container, or implement failover logic of the daemon container in Argo side. Eventually, we need to bind pods to the service object to guarantee connectivity during pod failover.

By the way, do you (or other contributors) have efforts to implement this feature? Our team is evaluating Argo, If you don’t have priority on that, we may add a patch to Argo. @simster7

@simster7
Copy link
Member

simster7 commented May 8, 2020

I'm looking into this feature myself as there are some tangential changes I want to make. I'll let you know if that changes. Thanks, @cy-zheng

@simster7
Copy link
Member

Hi @cy-zheng! Sorry for the delay on this. We had some internal discussions about this feature and I did some testing and we decided that the best way to support this use case is for users to manually create a StatefulSet and a Service using the resource template.

The example below:

  1. Creates a Service and a StatefulSet in parallel using resource templates
  2. Waits for the StatefulSet to be ready using another resource template
  3. Tests that the StatefulSet is accessible by sending a curl request
  4. Tears down the StatefulSet and Service in an onExit handler. The onExit handler is used to ensure that the resources get deleted regardless of if the Workflow fails and stops, or if it's stopped manually with argo stop.

Keep in mind that for Argo to create the StatefulSet and the Service it will need K8s RBAC permissions to do so. For testing in my computer, I just gave the ServiceAccount that Argo was using admin permissions:

$  kubectl -n argo create rolebinding delete-me-admin --clusterrole=admin --serviceaccount=argo:default

(This exact line might change depending on where Argo is installed and what service account it's configured to use)

Here is the full Workflow. Let me know if you have any questions.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: daemoned-stateful-set-with-service-
spec:
  entrypoint: create-wait-and-test
  onExit: delete

  templates:
  - name: create-wait-and-test
    steps:
    - - name: create-service
        template: create-service
      - name: create-stateful-set
        template: create-stateful-set

    - - name: wait-stateful-set
        template: wait-stateful-set

    - - name: test
        template: test

  - name: delete    # This is called as an onExit handler
    steps:
      - - name: delete-service
          template: delete-service
        - name: delete-stateful-set
          template: delete-stateful-set

  - name: create-service
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: Service
        metadata:
          name: nginx
          labels:
            app: nginx
        spec:
          ports:
            - port: 80
              name: web
          clusterIP: None
          selector:
            app: nginx

  - name: create-stateful-set
    resource:
      action: create
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web
        spec:
          selector:
            matchLabels:
              app: nginx # has to match .spec.template.metadata.labels
          serviceName: "nginx"
          template:
            metadata:
              labels:
                app: nginx # has to match .spec.selector.matchLabels
            spec:
              terminationGracePeriodSeconds: 10
              containers:
                - name: nginx
                  image: k8s.gcr.io/nginx-slim:0.8
                  ports:
                    - containerPort: 80
                      name: web

  - name: wait-stateful-set
    resource:
      action: get
      successCondition: status.readyReplicas == 1
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web

  - name: test
    container:
      image: curlimages/curl:latest
      command: [sh, -c]
      args: ["curl nginx"]

  - name: delete-service
    resource:
      action: delete
      flags: ["--ignore-not-found"]
      manifest: |
        apiVersion: v1
        kind: Service
        metadata:
          name: nginx
        
  - name: delete-stateful-set
    resource:
      action: delete
      flags: ["--ignore-not-found"]
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web

@simster7
Copy link
Member

@cy-zheng Let me know if the solution works for you!

@cy-zheng
Copy link
Contributor Author

Cool. It does work for me. The only one question from me is:
Would you remove the feature "daemon container" in the future? It seems useless if we start daemon services at this way. Also, RetryStrategy field inside daemon container could make user feel confused. @simster7

@cy-zheng
Copy link
Contributor Author

cy-zheng commented May 13, 2020

Hi @simster7 , there are some drawbacks if I use the solution above.

  1. Argo server cannot show stateful set related pod on web portal. So I cannot find some information like log, pod status through argo web portal.
  2. Also, these information won't be archived to MySQL/PostgreSQL
  3. There could be resource leak risk when user forget to delete statefulset/service etc., we need to do some operations on kubernetes directly to find out what happens.

To sum up, this solution is a good workaround to our problem, but there are something we can do to make Argo better. I think a good tool should cover these details from user, maybe a enhancement to daemon container is the best choice.
Anyway, thanks for your kindly solution!

@GuyPozner
Copy link

Hi,

I agree with all of @cy-zheng points, and I want to add another one:

  1. Daemon not restarting although a retry policy was set in it, is not an expected behavior, and it might reveal itself very late.

Can you please reopen this issue?

Thanks.

@Haim-ai21
Copy link

Question: what about if we having a workflow with Daemon step (daemon: true) for keeping the workflow up in the background and another step with retries (not daemon) that might crash.
Will the RetryStrategy will work for the failed POD (not daemon one)?
From my tests Pod is not restarted and I see no retries for the failed POD - so workflow is ended with error.
What am I am missing here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants