Skip to content

Conversation

@isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 23, 2024

Fixes #12993 and Fixes #13533

Motivation

The previous fix relied upon a Message field. There is no guarantee that this Message is always given to us.
We now directly check if a pod exists.

Modifications

Check if pod exists.

Verification

Unable to verify with certainty due to being a rare edge case.

@isubasinghe
Copy link
Member Author

/retest

1 similar comment
@isubasinghe
Copy link
Member Author

/retest

@isubasinghe
Copy link
Member Author

isubasinghe commented Oct 24, 2024

Seems to be failing to pull argocli:latest since the imagePullPolicy is Never.
I guess this image doesn't exist on the cluster as well.

@isubasinghe isubasinghe force-pushed the ignore-taskresults-when-failed branch from 8c06c57 to 3cb3151 Compare October 25, 2024 00:57
@isubasinghe isubasinghe force-pushed the ignore-taskresults-when-failed branch from 3cb3151 to 49ff7a4 Compare October 25, 2024 00:59
@isubasinghe
Copy link
Member Author

isubasinghe commented Oct 25, 2024

This won't quite work. The previous comment ^ is incorrect. The code fails tests due to a race condition.
Turns out we can use a mounted volume as side channel effectively.

Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe
Copy link
Member Author

Made a change that checks if the pod exists instead of the node Message.

@isubasinghe isubasinghe reopened this Oct 25, 2024
Copy link
Member

@jswxstw jswxstw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

By the way, I think IsPodDeleted can be removed since it is useless now.

Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe
Copy link
Member Author

/retest

agilgur5
agilgur5 previously approved these changes Oct 25, 2024
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense: if the Task Result exists, that means the Pod at some point existed, and if it no longer exists, then that means it was deleted.

The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the POD_ABSENT_TIMEOUT from #13454?

@Joibel
Copy link
Member

Joibel commented Oct 25, 2024

The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the POD_ABSENT_TIMEOUT from #13454?

I have a worry around this too. I'd like there to be some timeout between when we've noticed that a pod has disappeared so that a delayed but completed WorkflowTaskResult can arrive and be actioned.

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event, and something like POD_ABSENT_TIMEOUT (or similar) would give us that window.

@agilgur5
Copy link

agilgur5 commented Oct 25, 2024

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event

Ah I actually said the inverse race, which is less likely; this variant is more likely and possible too.

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on the basis that @agilgur5 and I both think this needs a time window. #13798 (comment)

@agilgur5
Copy link

I approved on the basis that this would catch more races than the code before this, but it creates a few too 😅
A more holistic fix would be great either way for sure

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe requested review from Joibel and jswxstw October 28, 2024 12:14
@isubasinghe isubasinghe merged commit eb23eb6 into argoproj:main Oct 28, 2024
isubasinghe added a commit that referenced this pull request Oct 30, 2024
#13533 (#13798)

Signed-off-by: isubasinghe <isitha@pipekit.io>
isubasinghe added a commit that referenced this pull request Oct 30, 2024
#13533 (#13798)

Signed-off-by: isubasinghe <isitha@pipekit.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/controller Controller issues, panics

Projects

None yet

4 participants