-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RetryStrategy doesn't apply to daemon container #2963
Comments
Good suggestion, I'll look into this |
By the way, what do you think about manage daemon containers by k8s deployment/statefulset object? @simster7 I think daemon containers is designed for background long-running services, so some configurations like |
Our limitation with using deployments or stateful sets is that IP addresses of the Pods are not guaranteed to stay the same across restarts. Such a functionality would require Argo or the user to set up a Service in order to interact with them reliably |
Yes. I think a good design is to use the deployment, stateful set to achieve the function of the daemon container, or implement failover logic of the daemon container in Argo side. Eventually, we need to bind pods to the service object to guarantee connectivity during pod failover. By the way, do you (or other contributors) have efforts to implement this feature? Our team is evaluating Argo, If you don’t have priority on that, we may add a patch to Argo. @simster7 |
I'm looking into this feature myself as there are some tangential changes I want to make. I'll let you know if that changes. Thanks, @cy-zheng |
Hi @cy-zheng! Sorry for the delay on this. We had some internal discussions about this feature and I did some testing and we decided that the best way to support this use case is for users to manually create a The example below:
Keep in mind that for Argo to create the
(This exact line might change depending on where Argo is installed and what service account it's configured to use) Here is the full Workflow. Let me know if you have any questions. apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: daemoned-stateful-set-with-service-
spec:
entrypoint: create-wait-and-test
onExit: delete
templates:
- name: create-wait-and-test
steps:
- - name: create-service
template: create-service
- name: create-stateful-set
template: create-stateful-set
- - name: wait-stateful-set
template: wait-stateful-set
- - name: test
template: test
- name: delete # This is called as an onExit handler
steps:
- - name: delete-service
template: delete-service
- name: delete-stateful-set
template: delete-stateful-set
- name: create-service
resource:
action: create
manifest: |
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
- name: create-stateful-set
resource:
action: create
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
- name: wait-stateful-set
resource:
action: get
successCondition: status.readyReplicas == 1
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
- name: test
container:
image: curlimages/curl:latest
command: [sh, -c]
args: ["curl nginx"]
- name: delete-service
resource:
action: delete
flags: ["--ignore-not-found"]
manifest: |
apiVersion: v1
kind: Service
metadata:
name: nginx
- name: delete-stateful-set
resource:
action: delete
flags: ["--ignore-not-found"]
manifest: |
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web |
@cy-zheng Let me know if the solution works for you! |
Cool. It does work for me. The only one question from me is: |
Hi @simster7 , there are some drawbacks if I use the solution above.
To sum up, this solution is a good workaround to our problem, but there are something we can do to make Argo better. I think a good tool should cover these details from user, maybe a enhancement to daemon container is the best choice. |
Hi, I agree with all of @cy-zheng points, and I want to add another one:
Can you please reopen this issue? Thanks. |
Question: what about if we having a workflow with Daemon step (daemon: true) for keeping the workflow up in the background and another step with retries (not daemon) that might crash. |
Checklist:
What happened:
When a worker node inside k8s cluster crashed, daemon container running on it couldn't restart.
The daemon container becomes "Error" state, however I've set up its retry strategy like
What you expected to happen:
Argo controller should recreate the daemon container pod.
How to reproduce it (as minimally and precisely as possible):
Just use
kubectl delete
to delete one of running daemon container's pod.Anything else we need to know?:
In fact, I think we could design "daemon container" better.
One of reason that daemon container cannot restart I guess, is the pod ip of daemon container has already been passed to following DAG node. I think a better solution for it, is binding a k8s service object to daemon container(s), and then pass cluster IP of the service to following DAG nodes.
That could also enable load balancing mechanism between multiple daemon containers.
Environment:
Argo version:
v2.7.5
Kubernetes version :
v1.13.12
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: