RetryStrategy doesn't apply to daemon container #2963

cy-zheng · 2020-05-06T14:38:08Z

Checklist:

I've included the version.
I've included reproduction steps.
I've included the workflow YAML.
I've included the logs.

What happened:

When a worker node inside k8s cluster crashed, daemon container running on it couldn't restart.
The daemon container becomes "Error" state, however I've set up its retry strategy like

retryStrategy:
      limit: 10
      retryPolicy: Always

What you expected to happen:

Argo controller should recreate the daemon container pod.

How to reproduce it (as minimally and precisely as possible):

Just use kubectl delete to delete one of running daemon container's pod.

Anything else we need to know?:

In fact, I think we could design "daemon container" better.
One of reason that daemon container cannot restart I guess, is the pod ip of daemon container has already been passed to following DAG node. I think a better solution for it, is binding a k8s service object to daemon container(s), and then pass cluster IP of the service to following DAG nodes.
That could also enable load balancing mechanism between multiple daemon containers.

Environment:

Argo version:
v2.7.5
Kubernetes version :
v1.13.12

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

simster7 · 2020-05-06T18:54:12Z

Good suggestion, I'll look into this

cy-zheng · 2020-05-07T03:31:18Z

By the way, what do you think about manage daemon containers by k8s deployment/statefulset object? @simster7 I think daemon containers is designed for background long-running services, so some configurations like RetryStrategy.Limit are useless in such cases.

simster7 · 2020-05-07T20:02:51Z

Our limitation with using deployments or stateful sets is that IP addresses of the Pods are not guaranteed to stay the same across restarts.

Such a functionality would require Argo or the user to set up a Service in order to interact with them reliably

cy-zheng · 2020-05-08T02:36:43Z

Yes. I think a good design is to use the deployment, stateful set to achieve the function of the daemon container, or implement failover logic of the daemon container in Argo side. Eventually, we need to bind pods to the service object to guarantee connectivity during pod failover.

By the way, do you (or other contributors) have efforts to implement this feature? Our team is evaluating Argo, If you don’t have priority on that, we may add a patch to Argo. @simster7

simster7 · 2020-05-08T18:17:26Z

I'm looking into this feature myself as there are some tangential changes I want to make. I'll let you know if that changes. Thanks, @cy-zheng

simster7 · 2020-05-12T15:32:31Z

Hi @cy-zheng! Sorry for the delay on this. We had some internal discussions about this feature and I did some testing and we decided that the best way to support this use case is for users to manually create a StatefulSet and a Service using the resource template.

The example below:

Creates a Service and a StatefulSet in parallel using resource templates
Waits for the StatefulSet to be ready using another resource template
Tests that the StatefulSet is accessible by sending a curl request
Tears down the StatefulSet and Service in an onExit handler. The onExit handler is used to ensure that the resources get deleted regardless of if the Workflow fails and stops, or if it's stopped manually with argo stop.

Keep in mind that for Argo to create the StatefulSet and the Service it will need K8s RBAC permissions to do so. For testing in my computer, I just gave the ServiceAccount that Argo was using admin permissions:

$  kubectl -n argo create rolebinding delete-me-admin --clusterrole=admin --serviceaccount=argo:default

(This exact line might change depending on where Argo is installed and what service account it's configured to use)

Here is the full Workflow. Let me know if you have any questions.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: daemoned-stateful-set-with-service-
spec:
  entrypoint: create-wait-and-test
  onExit: delete

  templates:
  - name: create-wait-and-test
    steps:
    - - name: create-service
        template: create-service
      - name: create-stateful-set
        template: create-stateful-set

    - - name: wait-stateful-set
        template: wait-stateful-set

    - - name: test
        template: test

  - name: delete    # This is called as an onExit handler
    steps:
      - - name: delete-service
          template: delete-service
        - name: delete-stateful-set
          template: delete-stateful-set

  - name: create-service
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: Service
        metadata:
          name: nginx
          labels:
            app: nginx
        spec:
          ports:
            - port: 80
              name: web
          clusterIP: None
          selector:
            app: nginx

  - name: create-stateful-set
    resource:
      action: create
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web
        spec:
          selector:
            matchLabels:
              app: nginx # has to match .spec.template.metadata.labels
          serviceName: "nginx"
          template:
            metadata:
              labels:
                app: nginx # has to match .spec.selector.matchLabels
            spec:
              terminationGracePeriodSeconds: 10
              containers:
                - name: nginx
                  image: k8s.gcr.io/nginx-slim:0.8
                  ports:
                    - containerPort: 80
                      name: web

  - name: wait-stateful-set
    resource:
      action: get
      successCondition: status.readyReplicas == 1
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web

  - name: test
    container:
      image: curlimages/curl:latest
      command: [sh, -c]
      args: ["curl nginx"]

  - name: delete-service
    resource:
      action: delete
      flags: ["--ignore-not-found"]
      manifest: |
        apiVersion: v1
        kind: Service
        metadata:
          name: nginx
        
  - name: delete-stateful-set
    resource:
      action: delete
      flags: ["--ignore-not-found"]
      manifest: |
        apiVersion: apps/v1
        kind: StatefulSet
        metadata:
          name: web

simster7 · 2020-05-13T04:23:15Z

@cy-zheng Let me know if the solution works for you!

cy-zheng · 2020-05-13T04:28:28Z

Cool. It does work for me. The only one question from me is:
Would you remove the feature "daemon container" in the future? It seems useless if we start daemon services at this way. Also, RetryStrategy field inside daemon container could make user feel confused. @simster7

cy-zheng · 2020-05-13T09:04:59Z

Hi @simster7 , there are some drawbacks if I use the solution above.

Argo server cannot show stateful set related pod on web portal. So I cannot find some information like log, pod status through argo web portal.
Also, these information won't be archived to MySQL/PostgreSQL
There could be resource leak risk when user forget to delete statefulset/service etc., we need to do some operations on kubernetes directly to find out what happens.

To sum up, this solution is a good workaround to our problem, but there are something we can do to make Argo better. I think a good tool should cover these details from user, maybe a enhancement to daemon container is the best choice.
Anyway, thanks for your kindly solution!

GuyPozner · 2022-04-21T06:21:41Z

Hi,

I agree with all of @cy-zheng points, and I want to add another one:

Daemon not restarting although a retry policy was set in it, is not an expected behavior, and it might reveal itself very late.

Can you please reopen this issue?

Thanks.

Haim-ai21 · 2024-08-13T07:06:40Z

Question: what about if we having a workflow with Daemon step (daemon: true) for keeping the workflow up in the background and another step with retries (not daemon) that might crash.
Will the RetryStrategy will work for the failed POD (not daemon one)?
From my tests Pod is not restarted and I see no retries for the failed POD - so workflow is ended with error.
What am I am missing here?

cy-zheng added the type/bug label May 6, 2020

simster7 self-assigned this May 6, 2020

simster7 mentioned this issue May 12, 2020

docs: Add StatefulSet and Service doc #3008

Merged

simster7 closed this as completed in #3008 May 12, 2020

MenD32 mentioned this issue Oct 15, 2024

feat(controller): retry strategy support on daemon containers, fixes #13705 #13738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RetryStrategy doesn't apply to daemon container #2963

RetryStrategy doesn't apply to daemon container #2963

cy-zheng commented May 6, 2020 •

edited

Loading

simster7 commented May 6, 2020

cy-zheng commented May 7, 2020

simster7 commented May 7, 2020

cy-zheng commented May 8, 2020

simster7 commented May 8, 2020

simster7 commented May 12, 2020

simster7 commented May 13, 2020

cy-zheng commented May 13, 2020

cy-zheng commented May 13, 2020 •

edited

Loading

GuyPozner commented Apr 21, 2022

Haim-ai21 commented Aug 13, 2024

RetryStrategy doesn't apply to daemon container #2963

RetryStrategy doesn't apply to daemon container #2963

Comments

cy-zheng commented May 6, 2020 • edited Loading

simster7 commented May 6, 2020

cy-zheng commented May 7, 2020

simster7 commented May 7, 2020

cy-zheng commented May 8, 2020

simster7 commented May 8, 2020

simster7 commented May 12, 2020

simster7 commented May 13, 2020

cy-zheng commented May 13, 2020

cy-zheng commented May 13, 2020 • edited Loading

GuyPozner commented Apr 21, 2022

Haim-ai21 commented Aug 13, 2024

cy-zheng commented May 6, 2020 •

edited

Loading

cy-zheng commented May 13, 2020 •

edited

Loading