Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store podname in nodestatus #12503

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Jan 13, 2024

Fixes #12528

Motivation

Modifications

The changes are relatively simple, I reduced the pod name generation code as much as possible and stored the PodnName in
the node itself.

Verification

Verified via testing, this change introduces a slight change to existing behaviour however. The change here is that we must have created a pod before a podName is assigned.

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review January 14, 2024 08:41
@sarabala1979
Copy link
Member

@isubasinghe why do we need the pod name in node status? Can you provide more details about usecase/scenario?
This will increase the workflow object size that will reduce the number of steps.
if the usecase is to find the pod for that particular node, node name can be added as label in the pod object.

@sarabala1979 sarabala1979 self-assigned this Jan 16, 2024
@isubasinghe
Copy link
Member Author

@sarabala1979 the issue isn't created yet, but this PR addressed this comment by Alex where we decided to store the podName in the Node status: #10267 (comment)

Comment on lines +496 to +501
if node.PodName != nil {
podName = *node.PodName
} else {
expectedPodName := util.GeneratePodName(wfName, node.Name, templateName, node.ID, podNameVersion)
panic("podName absent expected " + expectedPodName + " for " + node.Name)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the desirable behaviour, there might be cases where the pod has not been created yet.
What should we do in this case?

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only took a quick glance at this, but doesn't this need to modify the UI code as well?

Also, for backward-compatibility, we'd still have to generate the pod names for Workflows created before this change.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to test this thoroughly this time

@isubasinghe
Copy link
Member Author

I only took a quick glance at this, but doesn't this need to modify the UI code as well?

I thought I replied to this, must have just not pressed the comment button I suppose, this was on purpose, wanted to make the backend changes first and follow it up with the frontend changes, but happy to make the UI changes here as well if you'd like me to.

Also, for backward-compatibility, we'd still have to generate the pod names for Workflows created before this change.

Hmmm yeah you are right. What I might do is also keep the previous pod name generation code.
We can than then also use that to test if these changes kept the original behaviour.

@HumairAK
Copy link

Bumping this, @isubasinghe do you think you can rebase and resolve the conflicts?

@terrytangyuan this is something we use in kubeflow pipelines to retrieve pod names (we are currently planning to use a workaround you prescribed here), we are very much interested in seeing this merged, so we can rely on Argo to provide this value, what can we do to get this some more traction?

@terrytangyuan
Copy link
Member

We can discuss this in the upcoming contributors meeting to prioritize.

@agilgur5
Copy link
Member

agilgur5 commented Apr 15, 2024

@HumairAK this is currently a feature, so it wouldn't land until the next minor (currently 3.6). If KFP is only bumping to 3.4, this PR wouldn't solve your problem

@HumairAK
Copy link

@agilgur5 that is fine, hence why we plan to go with the work around for now, until we can upgrade to the version that (hopefully) has this change included, at which point we'll switch to relying on argo for this.

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this certainly simplifies things moving forward (not during a backward-compat window though), but in recent months I had found older comments from Alex C about trying to store as little as possible in the status due to our existing issues with large Workflows (nodeStatusOffload + general memory usage), such as: #9193 (comment):

Please don't add new fields to NodeStatus as each new fields needsO(n) extra storage when n is the number of nodes. It is better to traverse status and build a new structure if need. A one-time traverse would bo O(n).

I'm wondering if that was perhaps the original rationale behind some of the derivations in the codebase -- derive the field instead of storing it in status.

Effectively, this is a trade-off between storage and determinism. It actually reminds me quite heavily of the evolution of lockfiles in the JS ecosystem.
There though, determinism is simple to favor over storage, as hard disk/FS storage is cheap and plentiful (although VCS storage and diffs are a bit more complicated), compared to in-memory or etcd storage, where we have a hard limit

@agilgur5
Copy link
Member

@HumairAK ok. That does affect our prioritization though, as that means KFP won't be using this functionality in the short or likely near-term.

@agilgur5
Copy link
Member

agilgur5 commented Apr 16, 2024

Effectively, this is a trade-off between storage and determinism.

To be clear, I don't have a strong opinion on that per se; but I think we should be explicit about that -- larger workflows would be affected by this feature.

I probably lean toward agreeing with this feature, as it would simplify areas of the codebase where we do have some bugs / edge-cases currently that are handled in different ways in different places.

(not during a backward-compat window though)

The deprecation window is actually a bit complicated now that I think about it. If we ever want to remove the pod name generation code, we would break backward-compat with Workflows that were created in a version that doesn't contain the pod name in the status.

For instance, say 3.6 puts pod names into the status by default and has backward-compat to derive pod names when not present in the status. That backward-compat means the POD_NAMES env var still has to be set.

Then say, in 3.7 we were to remove the backward-compat -- all Workflows created with Argo <3.6 would no longer be processable, e.g. for retries or Pod logs in the UI and possibly other features.
We may perhaps want to keep this backward-compat around for more than 1 minor version as such (although I think Emissary only had 1 minor version of backward-compat as well: default in 3.3, only option in 3.4 -- so there may be precedent to only do 1 minor of backward-compat).

With that in mind, the pod name derivation code is going to still have to be around for a while. Defaulting to this feature means that'd we could eliminate bugs for newer Workflows, but might still have them present in older Workflows. So the frequency of the old bugs would decrease, but we may want to still fix them.
Or in other words, the backward-compat requirement decreases some of the benefit of this feature.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is that this will increase the object size.

@agilgur5
Copy link
Member

Discussed in today's Contributor Meeting. The storage concern that I brought up earlier was still relevant, especially as we've had more users recently reporting etcd getting full (particularly on managed k8s providers where you cannot configure your etcd space), such as #12802.

But given that I brought it up and I personally think the trade-off is worthwhile in this case -- determinism over storage and treat storage as a separate problem to optimize independent of any one field -- this has the green light to go.

We do need to implement the backward compat in this PR as well as decide whether pod names will be on all nodes or just pod nodes.

@juliev0 juliev0 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jul 8, 2024
Copy link
Contributor

This PR has been automatically marked as stale because it has not had recent activity and needs further changes. It will be closed if no further activity occurs.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jul 23, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

This PR has been closed due to inactivity and lack of changes. If you would like to still work on this PR, please address the review comments and re-open.

@github-actions github-actions bot closed this Aug 6, 2024
@HumairAK
Copy link

HumairAK commented Aug 6, 2024

this has the green light to go.

hmm would still be interested in seeing this change

@isubasinghe do you have the bandwidth to continue work on this?

@isubasinghe
Copy link
Member Author

@HumairAK will continue on this when I get the chance

@isubasinghe isubasinghe reopened this Aug 9, 2024
@isubasinghe isubasinghe marked this pull request as draft August 9, 2024 05:08
@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store podname in Node status to ensure single source of truth for PodName
6 participants