fix: fix missing artifacts for stopped workflows. Fixes #12401 #12402

Garett-MacGowan · 2023-12-23T07:51:00Z

Motivation

Stopped workflows would not have artifacts available in the UI, this can can be particularly annoying when debugging a workflow.

Modifications

Ensure workflow is archived only once all task results are finished processing.

Verification

Added a argo server test to ensure that the artifact is available. Manually verified that artifacts are now present in the UI.

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

…tputs in status is now considered. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan · 2023-12-23T22:59:45Z

@juliev0 Here's the fix for what I was hypothesizing in #12331

juliev0 · 2023-12-24T16:24:51Z

Thanks for being on top of any new regressions that may have been caused by your PR! I glanced the other day but then it sounded like you were saying that it was a UI issue. From going through that Issue now, it looks like this is not a direct fix to that issue, right? In that case, do you want to remove "Fixes" from your description and instead say something like "Related to"? Otherwise, this will automatically close that Issue.

Garett-MacGowan · 2023-12-24T16:55:51Z

@juliev0

These are two separate issues! #12331 must have been caused by another PR. I fixed that issue in #12397. This PR fixes #12401, which occurs when you stop a workflow. My previous PR (#11947) didn't guarantee that artifacts would get archived for stopped workflows, although it fixed the garbage collection.

juliev0 · 2023-12-24T17:27:16Z

Oh, okay!

juliev0 · 2023-12-25T03:29:34Z

workflow/controller/operator.go

@@ -802,7 +802,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {

 func (woc *wfOperationCtx) checkTaskResultsCompleted() bool {
 	taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted()
-	if len(taskResultsCompleted) == 0 {
+	if len(taskResultsCompleted) == 0 && woc.wf.Status.Outputs != nil {


It does seem like we want to make sure there are actually Outputs.

But now it makes me wonder, when do the Outputs get set? Is it possible to arrive here when Outputs==nil because they haven't been set yet?

@juliev0 Happy Holidays!

Outputs are first initialized and populated (at a node level) in taskResultReconciliation(). Outputs are subsequently initialized and populated to the global scope wf.Status.Outputs in woc.podReconciliation() via woc.addOutputsToGlobalScope(newState.Outputs), assuming Parameters or Artifacts are defined for any node. Therefore, Outputs==nil when either woc.TaskResultReconciliation() hasn't been run yet, or Parameters or Artifacts aren't defined for any node in the workflow.

As long as checkTaskResultsCompleted() has been run after taskResultReconciliation(), && woc.wf.Status.Outputs != nil is meaningful.

checkTaskResultsCompleted() is used in two places:

Before woc.garbageCollectArtifacts(ctx) (after taskResultReconciliation())

// Do artifact GC if all task results are completed. if woc.checkTaskResultsCompleted() { if err := woc.garbageCollectArtifacts(ctx); err != nil { woc.log.WithError(err).Error("failed to GC artifacts") return } } else { woc.log.Debug("Skipping artifact GC") }

Before woc.deletetaskResults(ctx) as part of persistUpdates(ctx) (called in conjunction with woc.wf.Status.Phase.Completed(), which ensures that the workflow has either succeeded, failed, or errored, and therefore after checkTaskResultsCompleted() has had a chance to run).

// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them. if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted() { if err := woc.deleteTaskResults(ctx); err != nil { woc.log.WithError(err).Warn("failed to delete task-results") } }

Happy holidays!

Trying to follow. So, based on your statement about how "Outputs are subsequently initialized and populated to the global scope wf.Status.Outputs in woc.podReconciliation()", doesn't that mean that Outputs could also be nil in the case that taskResultReconciliation() had been run but podReconciliation() had not?

Oops, I meant to say:

Before woc.deletetaskResults(ctx) as part of persistUpdates(ctx) (called in conjunction with woc.wf.Status.Phase.Completed(), which ensures that the workflow has either succeeded, failed, or errored, and therefore after woc.podReconciliation() has had a chance to run).

I missed the obvious (case 3), which I just introduced, inside markWorkflowPhase, which is called at the end of the operator loop under various conditions.

switch phase { case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError: // wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well. if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted() { woc.log.Info("Marking workflow completed")

Good catch, yes, you are right. The case you identified can happen for woc.garbageCollectArtifacts(ctx) (case 1), which gets called right after woc.taskResultReconciliation(), but before woc.podReconciliation(). I think a race condition could occur where the taskResultInformer hasn't received an update yet, and therefore garbage collection is allowed to happen because wf.Status.Outputs == nil.

Since woc.wf.Status.Outputs != nil is meant to check that outputs are in global scope (for archiving purposes), maybe it's best that we pull that logic up to if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted() {.

For example:

switch phase { case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError: taskResultsCompleted := woc.checkTaskResultsCompleted() // wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well. if phase.Completed() && !woc.hasDaemonNodes() && (taskResultsCompleted || (!taskResultsCompleted && woc.wf.Status.Outputs == nil)) { woc.log.Info("Marking workflow completed")

The above doesn't work. I'm thinking through it and will propose another solution.

@juliev0 I suggest the following changes. Notice the new considerStatusOutputs bool parameter for checkTaskResultsCompleted:

func (woc *wfOperationCtx) checkTaskResultsCompleted(considerStatusOutputs bool) bool { taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted() if considerStatusOutputs && woc.wf.Status.Outputs != nil && len(taskResultsCompleted) == 0 { return false } if !considerStatusOutputs && len(taskResultsCompleted) == 0 { return false } for _, completed := range taskResultsCompleted { if !completed { return false } } return true }

Case 1)

// Do artifact GC if all task results are completed. if woc.checkTaskResultsCompleted(false) { if err := woc.garbageCollectArtifacts(ctx); err != nil { woc.log.WithError(err).Error("failed to GC artifacts") return } } else { woc.log.Debug("Skipping artifact GC") }

Case 2)

// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them. if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted(true) { if err := woc.deleteTaskResults(ctx); err != nil { woc.log.WithError(err).Warn("failed to delete task-results") } }

Case 3)

switch phase { case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError: // wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well. if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted(true) { woc.log.Info("Marking workflow completed")

Sorry, going back....let's start from "what was the original intention of this new check?" To handle which use case?

@juliev0 If we don't do this check, the operator will hang in marking the workflow as completed if the workflow has no nodes with outputs because woc.checkTaskResultsCompleted will return false if len(taskResultsCompleted) == 0

len(taskResultsCompleted) == 0 can happen if there are no outputs for any nodes, so we can't always say that task results have not completed if len(taskResultsCompleted) == 0. If wf.Status.Outputs is defined and len(taskResultsCompleted) == 0, then we can say that task results have not completed.

If wf.Status.Outputs is defined (outputs or parameters do exist in at least one node), and taskResultsComplete is an empty map, then we know that the wait containers aren't done executing.

The new usage of woc.checkTaskResultsCompleted(true) in case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError: is to prevent the workflow from archiving before the status is updated with the outputs.

@juliev0 I just updated my last two comments since I did a really poor job in answering the first time.

juliev0 · 2023-12-27T23:58:33Z

workflow/controller/operator.go

@@ -802,7 +802,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {

 func (woc *wfOperationCtx) checkTaskResultsCompleted() bool {
 	taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted()
-	if len(taskResultsCompleted) == 0 {
+	if len(taskResultsCompleted) == 0 && woc.wf.Status.Outputs != nil {
 		return false
 	}


Sorry, do you mind if we first go back to the original code and talk about the case you were trying to take care of with if len(taskResultsCompleted) == 0 {?

BTW, maybe we should rename taskResultsCompleted to something like taskResultCompletionStatus since it sounds like we're talking about the number completed here but we're not.?

As far as I can tell, we're looking at the case in which no taskresults have ever even been initialized as incomplete, right?

Yeah, we could probably do with a rename.

Yes, we're looking at the case where no task results have been initialized as incomplete yet.

Is this the case we're concerned about, or is it more generally speaking the case in which a task result would be initialized in the future and hasn't been yet? As far as I can tell, it's once a Pod exists that a new task result will get added? So, wouldn't it be the same case if say this map has one entry and it's completed, but another task result simply hasn't been initialized yet?

💩 Yes, you're right. We should be handling that. I'm going to think on this a bit more & try to come up with something better.

I've flipped the way I'm looking at this and reformulated the check as checkTaskResultsInProgress. When a task result hasn't completed yet, it is added to a TaskResultsInProgress map. Once it is completed, it is removed from the map.

func (woc *wfOperationCtx) checkTaskResultsInProgress() bool { taskResultsInProgress := woc.wf.Status.GetTaskResultsInProgress() if len(taskResultsInProgress) != 0 { return true } return false }

In all three cases, !woc.checkTaskResultsInProgress() is called. All tests seem to be passing & the code is much easier to understand. I'm re-running the test script from my last PR to ensure the garbage collection race condition doesn't exist. #11947 (comment).

Assuming all the tests pass locally, I'll push and you can review @juliev0

@juliev0 The case of "no task results have been initialized as incomplete yet" doesn't actually seem to be an issue based on my testing (with the new approach).

… 12401. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

juliev0 · 2023-12-28T22:44:12Z

workflow/controller/operator.go

@@ -240,7 +240,7 @@ func (woc *wfOperationCtx) operate(ctx context.Context) {
 	woc.taskResultReconciliation()

 	// Do artifact GC if all task results are completed.
-	if woc.checkTaskResultsCompleted() {
+	if !woc.checkTaskResultsInProgress() {


we don't need to check if Phase.Completed() for this case?

The artifact gc strategy protects us for this case. We could conceivably add this though. It would make things a bit more efficient.

plus, if we have common logic between this and the other calls, we might be able to have a common function, which would be nice

I'm actually kind of thinking about a common function which checks for the Workflow being completed, plus compares the number of pods that were run to the number of task results that got incorporated into the WorkflowStatus.Outputs, which could be called Workflow.ReconciliationComplete() or something like that. Does that make any sense?

This method could be called at any place in the code, and it should work.

I know in this case you were only concerned about which taskResults were saved at all, as opposed to which ones had actually gone through podReconciliation() (incorporated into WorkflowStatus.Outputs as well, so maybe the common function wouldn't be used here if it doesn't make sense.

(just edited my comment 2 comments back because it wasn't clear what i was talking about)

juliev0 · 2023-12-28T23:03:31Z

workflow/controller/operator.go

-		// wait for all daemon nodes to get terminated before marking workflow completed
-		if markCompleted && !woc.hasDaemonNodes() {
+		// wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well.
+		if phase.Completed() && !woc.hasDaemonNodes() && !woc.checkTaskResultsInProgress() {


We don't need podReconciliation() to have updated Workflow Status for this one?

juliev0 · 2023-12-28T23:04:22Z

workflow/controller/operator.go

@@ -788,7 +788,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {
 	}

 	// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them.
-	if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted() {
+	if woc.wf.Status.Phase.Completed() && !woc.checkTaskResultsInProgress() {


Do we also need to make sure that the podReconciliation has occurred to incorporate the taskResults into the Workflow's Status, or is that not needed based on where this is being called from?

juliev0 · 2023-12-28T23:51:01Z

Something I alluded to in my other comment, but I'm kind of wondering if there could be a risk of only checking whether there's a taskresult in progress - that requires that it got added to the map in the first place. It may be okay if it's only called from a particular place in the code I suppose.

But I was thinking: isn't the simplest thing to check if the Workflow is complete, plus confirm that all TaskResults have been fully processed (therefore equal to the number of Pods run), which could be a function called from anywhere?

Just let me know if you disagree! I haven't looked that carefully.

Garett-MacGowan · 2023-12-29T03:59:15Z

What you are saying is probably the safest solution. Here's what I'm thinking now:

As part of podReconciliation(), outputs are added to the global scope, then, the node phase is updated based on the pod phase. If we observe that the node phase is completed, then we know that podReconciliation() has been run for the last time for that node, and any outputs for the node at that time should be reflected in the global scope. One problem to consider, however, is what happens when outputs are sent to the taskResultInformer between when taskResultReconciliation() and podReconciliation() are called. Outputs sent to the taskResultInformer during this time won't be reflected in the node state, and therefore won't be reflected in the global scope. At this point, i'm considering the possibility of adding woc.addOutputsToGlobalScope() to woc.taskResultReconciliation()

Assuming the above issue is resolved. it should be okay to use something like:

func (woc *wfOperationCtx) checkReconciliationComplete() {
	numCompletedNodes := len(woc.wf.Status.Nodes.Filter(func(x wfv1.NodeStatus) interface{} {
		return x.Phase.Completed()
	}))
	return woc.wf.Status.Phase.Completed() && numCompletedNodes == woc.wf.Status.GetTaskResultsCompleted()
}

Thoughts?

juliev0 · 2023-12-29T06:04:24Z

Thanks for all the flexibility on the approach! I’m on vacation now, so may not get a chance to look at your response until after 1/1 unfortunately. I’ll get back to you then.

Garett-MacGowan · 2023-12-29T15:50:26Z

Have a good vacation! I will take a crack at what I've suggested above.

…j#12401 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan · 2023-12-30T04:21:41Z

@juliev0 I had to make some tweaks the implementation I suggested above, but I think I've got something more along the lines of what you were thinking.

Garett-MacGowan · 2023-12-30T06:38:54Z

Working through the test failures. They are to do with termination as far as I can tell.

Garett-MacGowan · 2023-12-31T18:16:36Z

Unfortunately the number of pods run is not a reliable indicator of wait container execution. If the workflow is terminated as soon as it is created, the wait container does not start up and FinalizeOutput is never called.

I'm going to push an alternative solution.

…robustness of reconciliation completion check. Add check for workflow completion to reconciliation completion check. Add call to addOutputsToGlobalScope to ensure that reconciled outputs are included in archived workflows. Fixes: argoproj#12401 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

…ent. Fixes argoproj#12401. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan · 2024-01-03T19:05:29Z

@juliev0 I feel like the tests fail 1/3 times because the argo exec image download keeps timing out. It's super annoying to have to keep submitting empty commits to re run the CI. Is there anything in the works to address this?

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

juliev0 · 2024-01-03T19:56:51Z

@juliev0 I feel like the tests fail 1/3 times because the argo exec image download keeps timing out. It's super annoying to have to keep submitting empty commits to re run the CI. Is there anything in the works to address this?

Sorry, I know the feeling. There was a discussion of potentially implementing automatic retries for failures in the CI but somebody decided against it since it would hide actual failures and they preferred to address failures. As for the argoexec image download timing out, I'm not sure why that's happening so frequently. Is it just today or have you seen it in the past too?

juliev0 · 2024-01-03T19:58:59Z

@Garett-MacGowan Thanks for keeping an eye on the new incoming bugs. Now that I've merged this, it would be great if you could continue to keep an eye on anything that could be caused by this, just in case, such as related to output parameters/artifacts, or workflows not being cleaned up, etc.

Garett-MacGowan · 2024-01-03T20:20:58Z

Thanks for merging! Will do.

Garett-MacGowan · 2024-01-03T20:27:56Z

Sorry, I know the feeling. There was a discussion of potentially implementing automatic retries for failures in the CI but somebody decided against it since it would hide actual failures and they preferred to address failures. As for the argoexec image download timing out, I'm not sure why that's happening so frequently. Is it just today or have you seen it in the past too?

This one has been happening for at least the last few days.

I can understand not wanting to automatically retry test failures, but setup failures (like image downloads) fall into a different category. Perhaps we can increase the timeout for that? Maybe it's happening when there is a particularly high load on the image server?

juliev0 · 2024-01-03T20:33:10Z

Very true. Definitely feel free to create a PR for that if you're motivated! (I for one am only spending a limited amount of time working on this repo these days)

argoproj#12402) Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> Signed-off-by: Raymond Chow <rchow@atlassian.com>

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

zhangtbj · 2024-01-19T03:33:21Z

Hi all,

We face a regression problem after upgrading to release v3.5.3. It should be related to this PR.

We have a global artifact which is used and exported in two different steps WITH different values:

e2e-test-inject-runner-context
e2e-test-output-testupdate

In the previous release, we can get the current global artifact from the last step (e2e-test-output-testupdate).
But in the new release v3.5.3 , the global artifact will be overwritten. so the artifact will be picked up randomly from these two steps which make our process failed.

I found there is a NEW change here:
https://github.com/argoproj/argo-workflows/blob/v3.5.3/workflow/controller/taskresult.go#L89-L90

  // Add outputs to global scope here to ensure that they are reflected in archive.
  woc.addOutputsToGlobalScope(newNode.Outputs)

Which will overwrite the global artifact after every task and should be the reason to make our process failed. We can find the overwriting log here:

time="2024-01-19T02:03:53.554Z" level=debug msg="task result name:\ne2e-test" namespace=action-dev2 workflow=e2e-test
time="2024-01-19T02:03:53.554Z" level=debug msg="Marking task result complete e2e-test" namespace=action-dev2 workflow=e2e-test
time="2024-01-19T02:03:53.554Z" level=info msg="overwriting workflow.outputs.artifacts.testUpdate: {testUpdate  <nil>  {<nil> &S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,EncryptionOptions:nil,CASecret:nil,},Key:e2e-test/e2e-test-inject-input-type-instances/testUpdate.tgz,} nil nil nil nil nil nil nil nil}  nil false  false  nil false}" namespace=action-dev2 workflow=e2e-test-1-4065749210214828732

I don't know if I found the current code and don't know how to solve that.

Can you please help check if my investigation is correct?

I also posted a comment in the argo-workflow channel:
https://cloud-native.slack.com/archives/C01QW9QSSSK/p1705635020012259

Thanks!

zhangtbj · 2024-01-19T03:33:58Z

Maybe also cc @Garett-MacGowan for help. Thanks!

Garett-MacGowan · 2024-01-19T04:27:17Z

@zhangtbj Do you have a workflow example to reproduce? I can see if I can fix it for next release.

Garett-MacGowan · 2024-01-19T06:43:30Z

@zhangtbj

the global artifact will be overwritten. so the artifact will be picked up randomly from these two steps which make our process failed.

This makes it sound like you are running these steps in parallel. If that's the case, then I think you're getting expected behavior. Either way, send a long a workflow that reproduces the issue and I will have a better idea of what you're trying to do.

zhangtbj · 2024-01-19T07:02:34Z

Hi @Garett-MacGowan ,

Thanks for your reply. We are not running these steps in parallel, but it is also not a simple workflow. Please check:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generation: 21
  generateName: test-artifact-12402-
  namespace: action-dev2
spec:
  activeDeadlineSeconds: 28800
  arguments: {}
  entrypoint: test-root
  serviceAccountName: test
  templates:
  - inputs: {}
    metadata: {}
    name: test-root
    outputs: {}
    steps:
    - - arguments: {}
        name: inject-input-artifact
        template: inject-input-artifact
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testInput}}'
            name: testInput
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: testUpdate
        name: start-entrypoint
        template: main
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpload}}'
            name: testUpload
        name: upload-output-type-instances-step
        template: upload-output-type-instances
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: testUpdate
        name: upload-update-type-instances-step
        template: upload-update-type-instances

  - inputs: {}
    metadata: {}
    name: main
    outputs: {}
    steps:
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testInput}}'
            name: input
        name: cp
        template: cp
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: input-parameters
        name: render-update-ti
        template: main-render-update-ti-template
    - - arguments:
          artifacts:
          - from: '{{steps.cp.outputs.artifacts.upload}}'
            name: testUpload
        name: output-testupload
        template: output-testupload
    - - arguments:
          artifacts:
          - from: '{{steps.render-update-ti.outputs.artifacts.render}}'
            name: testUpdate
        name: output-testupdate
        template: output-testupdate

  - container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["sleep 1; echo -n test input > /testInput.yaml; echo -n test update > /testUpdate.yaml"]
    name: inject-input-artifact
    outputs:
      artifacts:
      - globalName: testInput
        name: testInput
        path: /testInput.yaml
      - globalName: testUpdate
        name: testUpdate
        path: /testUpdate.yaml

  - container:
      command: [sh, -c]
      args: ["sleep 1; cp /input /upload"]
      image: alpine:3.7
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: input
        path: /input
    metadata: {}
    name: cp
    outputs:
      artifacts:
      - name: upload
        path: /upload

  - container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["sleep 1; echo -n change update!!! > /render.yml"]
    metadata: {}
    name: main-render-update-ti-template
    outputs:
      artifacts:
      - name: render
        path: /render.yml

  - container:
      args:
      - sleep 1
      command:
      - sh
      - -c
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpload
        path: /typeinstance
    metadata: {}
    name: output-testupload
    outputs:
      artifacts:
      - globalName: testUpload
        name: testUpload
        path: /typeinstance

  - container:
      args:
      - sleep 1
      command:
      - sh
      - -c
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpdate
        path: /typeinstance
    metadata: {}
    name: output-testupdate
    outputs:
      artifacts:
      - globalName: testUpdate
        name: testUpdate
        path: /typeinstance



  - container:
      args:
      - sleep 1
      command: [sh, -c]
      args: ["sleep 1; cat /upload/typeInstances/testUpload; cat /upload/typeInstances/testUpload > /upload/result.yaml"]
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpload
        path: /upload/typeInstances/testUpload
    metadata: {}
    name: upload-output-type-instances
    outputs:
      artifacts:
      - globalName: uploadresult
        name: uploadresult
        path: /upload/result.yaml

  - container:
      args:
      - sleep 1
      command: [sh, -c]
      args: ["sleep 1; cat /upload/typeInstances/testUpdate"]
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpdate
        path: /upload/typeInstances/testUpdate
    metadata: {}
    name: upload-update-type-instances
    outputs: {}

If you run this workflow several times, you will find the final global output artifact will be changed to:

testUpdate is refered to test-artifact-12402-2cntl/test-artifact-12402-2cntl-inject-input-artifact-46224399/testUpdate.tgz

But acutally, it should use the the artifact from later step, like:
test-artifact-12402-2cntl/test-artifact-12402-2cntl-output-testupdate-4251288622/testUpdate.tgz

Please let me know if anything is not clear. Thanks!

zhangtbj · 2024-01-19T07:07:26Z

I am not sure if it is a special case that it is not covered well.

start-entrypoint step is a template with some inner steps, and we update the testUpdate global output artifact in these inner steps.
The problem happens in the next step upload-output-type-instances-step just after start-entrypoint, I am not sure if it loads artifacts differently or delays （But it works correctly before).

Before step upload-output-type-instances-step, the status is:

After step upload-output-type-instances-step, sometimes the status is changed to:

Garett-MacGowan · 2024-01-19T20:53:27Z

@zhangtbj I have a hypothesis for what is wrong. I'm working though how I can properly test it. Can you open an issue with the information above? I will submit a fix linked to that issue. Please mention this issue in the new issue. Thanks!

zhangtbj · 2024-01-20T01:08:53Z

Hi @Garett-MacGowan ,

Sure thing. I opened a new task: #12554 and link this issue in the new issue.

Thanks!

fix: fix missing artifacts for stopped workflows. Fixes argoproj#12401

abcd87e

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan mentioned this pull request Dec 23, 2023

Artifact in workflow diagram not working when workflow is archived and deleted from cluster #12331

Closed

3 tasks

Garett-MacGowan added 4 commits December 23, 2023 15:44

fix: Bug fix for checkTaskResultsComplete function. Presence of of ou…

36d98b4

…tputs in status is now considered. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

fix: Fix tests. Fixes argoproj#12401

3f5b70d

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Merge branch 'main' into fix/stopped_workflow_artifacts

0f22f3c

fix: Empty commit for re-running transient test failure.

785b463

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan marked this pull request as ready for review December 23, 2023 23:30

juliev0 self-assigned this Dec 24, 2023

juliev0 reviewed Dec 25, 2023

View reviewed changes

juliev0 reviewed Dec 27, 2023

View reviewed changes

fix: Simplify logic for checking task result completion status. Fixes…

b090a88

… 12401. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan requested review from terrytangyuan and sarabala1979 as code owners December 28, 2023 05:49

fix: Fix lint error. Fixes argoproj#12401

a019a3e

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

juliev0 reviewed Dec 28, 2023

View reviewed changes

fix: Make task result reconciliation check more robust. Fixes argopro…

1f66956

…j#12401 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

fix: Remove extraneous apostrophe in TaskResultsCompletionStatus comm…

4782f40

…ent. Fixes argoproj#12401. Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

fix: empty commit for ci resubmit. Fixes argoproj#12401

05e01d3

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

juliev0 approved these changes Jan 3, 2024

View reviewed changes

juliev0 merged commit c63c2bc into argoproj:main Jan 3, 2024
27 checks passed

agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Jan 4, 2024

sarabala1979 pushed a commit that referenced this pull request Jan 9, 2024

fix: fix missing artifacts for stopped workflows. Fixes #12401 (#12402)

5568a25

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>

Garett-MacGowan mentioned this pull request Jan 9, 2024

REQUEST: New membership for @Garett-MacGowan argoproj/argoproj#269

Closed

5 tasks

agilgur5 mentioned this pull request Jan 12, 2024

docs: update backported commits to https://argo-workflows.readthedocs.io/en/release-3.5 #12499

Merged

zhangtbj mentioned this pull request Jan 20, 2024

The global output cannot be used correctly after upgrade v3.5.3 #12554

Closed

4 tasks

Garett-MacGowan mentioned this pull request Jan 20, 2024

fix: Global Artifact Passing. Fixes #12554 #12559

Merged

Joibel mentioned this pull request Jan 24, 2024

Succeeded jobs counted into parallelism limit #12541

Closed

4 tasks

agilgur5 mentioned this pull request Feb 4, 2024

Workflow in Running but workflows.argoproj.io/completed is true #12615

Closed

4 tasks

jesseanttila-cai mentioned this pull request Apr 29, 2024

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

Closed

4 tasks

agilgur5 mentioned this pull request Jun 24, 2024

Template with retry and exit hook stuck in Pending #13239

Open

4 tasks

fix: fix missing artifacts for stopped workflows. Fixes #12401 #12402

fix: fix missing artifacts for stopped workflows. Fixes #12401 #12402

Conversation

Garett-MacGowan commented Dec 23, 2023

Motivation

Modifications

Verification

Garett-MacGowan commented Dec 23, 2023

juliev0 commented Dec 24, 2023 • edited Loading

Garett-MacGowan commented Dec 24, 2023 • edited by agilgur5 Loading

juliev0 commented Dec 24, 2023 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Garett-MacGowan Dec 26, 2023 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garett-MacGowan Dec 27, 2023 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garett-MacGowan Dec 27, 2023 • edited by agilgur5 Loading

Choose a reason for hiding this comment

juliev0 Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Garett-MacGowan Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Garett-MacGowan Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Garett-MacGowan Dec 28, 2023 • edited by agilgur5 Loading

Choose a reason for hiding this comment

Garett-MacGowan Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

juliev0 Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

juliev0 Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

juliev0 Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliev0 commented Dec 28, 2023

Garett-MacGowan commented Dec 29, 2023 • edited by agilgur5 Loading

juliev0 commented Dec 29, 2023

Garett-MacGowan commented Dec 29, 2023 • edited by agilgur5 Loading

Garett-MacGowan commented Dec 30, 2023

Garett-MacGowan commented Dec 30, 2023

Garett-MacGowan commented Dec 31, 2023 • edited Loading

Garett-MacGowan commented Jan 3, 2024

juliev0 commented Jan 3, 2024

juliev0 commented Jan 3, 2024

Garett-MacGowan commented Jan 3, 2024 • edited by agilgur5 Loading

Garett-MacGowan commented Jan 3, 2024 • edited by agilgur5 Loading

juliev0 commented Jan 3, 2024 • edited by agilgur5 Loading

zhangtbj commented Jan 19, 2024 • edited by agilgur5 Loading

zhangtbj commented Jan 19, 2024

Garett-MacGowan commented Jan 19, 2024

Garett-MacGowan commented Jan 19, 2024

zhangtbj commented Jan 19, 2024 • edited by agilgur5 Loading

zhangtbj commented Jan 19, 2024 • edited Loading

Garett-MacGowan commented Jan 19, 2024

zhangtbj commented Jan 20, 2024

juliev0 commented Dec 24, 2023 •

edited

Loading

Garett-MacGowan commented Dec 24, 2023 •

edited by agilgur5

Loading

juliev0 commented Dec 24, 2023 •

edited by agilgur5

Loading

Garett-MacGowan Dec 26, 2023 •

edited by agilgur5

Loading

Garett-MacGowan Dec 27, 2023 •

edited by agilgur5

Loading

Garett-MacGowan Dec 27, 2023 •

edited by agilgur5

Loading

juliev0 Dec 27, 2023 •

edited

Loading

Garett-MacGowan Dec 27, 2023 •

edited

Loading

Garett-MacGowan Dec 27, 2023 •

edited

Loading

Garett-MacGowan Dec 28, 2023 •

edited by agilgur5

Loading

Garett-MacGowan Dec 28, 2023 •

edited

Loading

juliev0 Dec 28, 2023 •

edited

Loading

juliev0 Dec 28, 2023 •

edited

Loading

juliev0 Dec 28, 2023 •

edited

Loading

juliev0 Dec 28, 2023 •

edited

Loading

Garett-MacGowan commented Dec 29, 2023 •

edited by agilgur5

Loading

Garett-MacGowan commented Dec 29, 2023 •

edited by agilgur5

Loading

Garett-MacGowan commented Dec 31, 2023 •

edited

Loading

Garett-MacGowan commented Jan 3, 2024 •

edited by agilgur5

Loading

Garett-MacGowan commented Jan 3, 2024 •

edited by agilgur5

Loading

juliev0 commented Jan 3, 2024 •

edited by agilgur5

Loading

zhangtbj commented Jan 19, 2024 •

edited by agilgur5

Loading

zhangtbj commented Jan 19, 2024 •

edited by agilgur5

Loading

zhangtbj commented Jan 19, 2024 •

edited

Loading