-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artifacts not set in Status.Nodes | ArtifactGC prevents Workflow deletion #11879
Comments
As promised, here's the new ticket dedicated to my issue @juliev0 |
Okay, ultimately the root cause is outside of the ArtifactGC logic itself, but nonetheless I can try to look at this when I get time. |
@juliev0 much appreciated. |
@juliev0 I took a peek at the source code. I have a hypothesis that I'm hoping you can shine some light on. Perhaps it can expedite your investigation when you get around to it. My best guess is that the code execution never gets to this line here in I think the execution never reaches the above due, perhaps, to this line here in The solution may be to set outputs in The same issue seems to exist in Thanks for the help! |
That's good investigation. I don't see any reason why Outputs shouldn't be set in that case. @agilgur5 do you see any reason? @Garett-MacGowan would you be able to try seeing if you can fix this? I don't have much bandwidth and really haven't worked myself with this particular part of the code. |
I can give it a go. Will have to set up the dev environment. Any guidance would be helpful @agilgur5. |
Great, thank you. Have you seen this page?: https://argoproj.github.io/argo-workflows/running-locally/ if you have questions let us know. |
Ah well ironically I'm not too familiar with ArtifactGC but if this is actually in steps or DAG, that would be a part of the codebase that I'm familiar with. Guess we're switching off 😅 I can take a look tomorrow-ish. From a quick glance, @Garett-MacGowan's logic makes sense to me. It is an edge case if a step completes after stop, so it might've been missed. IIRC there were some recent changes to make sure artifacts are uploaded if a step completes after stop, so maybe outputs were missed in that change
Devcontainer works pretty well for a fast set-up nowadays. I use it myself as do some other contributors |
@juliev0 I've seen that page, thanks. I will work through dev setup tomorrow. @agilgur5 Thanks! Let me know if your look around results in anything I should know. If you see no problems with it, I'll implement a change so that |
WIP updateFindingsAs I work through this, I'm noticing the issue seems to be related only to stopping the workflow (a failure alone doesn't do the trick). I've got a test case that reproduces the error, so now I'm implementing the fix. Since there's no test cases for a step failure with GC, I'll implement one for that too, to be sure. My contribution experience + blocksJetbrains Dev Container A note in the setup docs to avoid using Dev Container argoexec-image failure loop Once I finally got the Dev Container working on VS Code, I still ran into an issue where the container would boot up but require a restart of VS Code to enter the environment. It would otherwise enter a loop where one step would continually fail.
Once I restart VS Code, it seems to work fine, but I'm not certain that this error has no consequences... Multiple terminals to segregate UI and Controller I don't think using one terminal for the controller + server and another for the UI works.... I'm not seeing any reference to CTRL in the No debug logs for workflow controller It seems that there is no loglevel=debug for the workflow controller as my Dev Container debug logs (enabled by 4+enter) don't show the workflow debug logs. When I add args:
- --loglevel=debug to apiVersion: apps/v1
kind: Deployment
metadata:
name: workflow-controller
spec:
replicas: 0
template:
spec:
containers:
- name: workflow-controller
imagePullPolicy: Never
args:
- --loglevel=debug |
Update for the Dev Container argoexec-image failure loop I'm working on Windows. It turns out the signal_linux.go is a symlink. This may be my issue. I'll try and resolve and update. |
Thanks for your efforts, analysis, & feedback @Garett-MacGowan! ❤️
Yea devcontainers are still pretty new, though I am surprised that Jetbrains doesn't just immediately throw an error or warning if it detects Would you like to submit a PR to add a Nowadays I just use the devcontainers CLI so that it doesn't affect my editor. I'm pretty used to developing with Docker/Compose CLI and headless Vagrant, so that was pretty familiar to me.
I have seen it occasionally fail on random steps, and indeed restarting does the trick. Not really clear why it's failing unfortunately 😕
Oh good find! I've never tried that before. I think this section probably pre-dates the usage of |
Ok, to fix the Dev Container argoexec-image failure loop, I had to enable symlinks in git. # To see the current config
git config --local --list
# To enable symlinks
git config --local core.symlinks true |
Sorry. This is kind of hacky but what I did for this was change this line locally to:
There's a bug in this project that it's not using the (I know this is not well publicized.) Have you run into this as well @agilgur5 ? I imagine it must be frustrating a lot of people. |
argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Here's my WIP branch for reference: https://github.com/GarettSoftware/argo-workflows/tree/fix/artifactgc_stopped_workflow |
Quick update: it turns out that outputs aren't present in
|
A couple more things I don't understand @agilgur5 || @juliev0 Wait container intermittent errors
How do these things connect? |
I've confirmed that result.Outputs is properly defined here. I don't think the issue is related to the wait container.
|
This specific log level issue, maybe? tbh, I have been super confused by the logs from It's been mentioned a few times to replace |
How do you mean "muxing"? The individual logs are in the
Yes, I would agree. It doesn't seem like he's working on it anyway, and now that he's not a lead he may not even object. |
As in, the output of running Tailing the individual file is perhaps a good workaround -- I hadn't thought of doing that! Managed to get by with limited log usage so far 😅 |
…ias for test timeouts in slow environments. Adjust WaitForWorkflow timeout parameter in retry_test.go (added 30 seconds) to be in line with the previously defined timeout before fixing the WaitForWorkflow custom timeout logic in commit 4ffeb5f. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…rgoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…orkflow deletions during testing (in case tests are running in parallel). Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…tifactGC. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…tifactGC. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…. Towards argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…desired effect. Towards argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…mplify LabelKeyReportOutputsCompleted label setting logic. Add support for legacy/insecure pod patch for ReportOutputsCompleted annotation/label. Towards argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…it container defer methods to ensure they complete. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…owards argoproj#11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…deletion of task results. Change owner reference for task result to Workflow. Add artifactgc test for stopped workflow with pod gc on pod completion. Fixes argoproj#11879 Signed-off-by: Garett MacGowan <garett.macgowan@gmail.com>
…1879 (argoproj#11947)" This reverts commit c296cf2.
Pre-requisites
:latest
What happened/what you expected to happen?
This issue stems from #10840. I expect to be able to delete a workflow and have the garbage collection remove the artifact from the repository.
In this specific case, the workflow finalizer isn't removed from the workflow due to a failure here, implying that the artifacts aren't being written to Status.Nodes. Upon inspecting the workflow manifest's
Status.Nodes
, I can confirm that no output artifacts are listed. Artifacts are successfully being written to GCS upon triggering "stop" for the workflow.Steps to Reproduce
Partial Remedy
While preparing this ticket, I notice that by adding at least one successful workflow and artifact to my DAG (in parallel), the workflow's finalizer is successfully deleted & the successful DAG step's artifacts are also successfully cleaned up, however, the failed (stopped) DAG step's artifacts are not cleaned up. This can lead to dangling resources in GCS.
Suggested Fix
Ensure that stopped/failed workflow step's output artifacts are written to
Status.Nodes
Version
3.5.0-rc1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: