Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report zone via well-known topology key in NodeGetInfo #1931

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

ConnorJC3
Copy link
Contributor

@ConnorJC3 ConnorJC3 commented Feb 13, 2024

Is this a bug fix or adding new feature?

Both?

What is this PR about? / Why do we need it?

See #729 (comment)

What testing is done?

  • Manually tested locally by attaching volumes created by both this and a previous version of the driver, and ensuring they both attached
  • e2e (see CI)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 13, 2024
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 13, 2024
Copy link

Code Coverage Diff

This PR does not change the code coverage

@ConnorJC3 ConnorJC3 force-pushed the update-node-topology branch from f59568d to b32f203 Compare February 13, 2024 17:50
Copy link
Contributor

@AndrewSirenko AndrewSirenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to resolve this paper-cut.

Had a small nit about renaming WellKnownTopologyKey but realized it is CO-agnostic and has existed for 3 years already.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2024
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, so I took some time to review the history here and my current understanding is that this PR is not necessary - ill explain.

First, ill start by laying out some context that both the author of this PR and reviewers already know, but will be helpful to future maintainers and external contributors to follow this discussion.

There are 2 places where topology labels come into the picture:

  • Controller service - creates PVs with a topology label.
  • Node service - adds a topology label on the node.

Kubernetes or a CO will then use that topology information to ensure that a volume is accessible from a given node when scheduling workloads (wording directly from CSI spec).

That is to say, we need to ensure that dynamically provisioned PVs have a label that nodes can satisfy, if the label keys don't match the volumes will not be scheduled.


Back in circa 2021, maintainers of this project tried to switch to the well-known label (which was not so well-known at the time) and made a mistake in adding both the custom label AND the well-known label to provisioned PVs in the controller side. An important detail here as pointed out by Michelle is that the node selector matchExpression is ANDed, which I understand to mean that in order to schedule those volumes, both labels (the custom, and the well-known) must exist on the node.

Now the reason why this PR is not necessary: CCM already adds the well-known label to all nodes -- it does not make sense for the driver to override that. I have also manually tested and confirmed that to be true.

In short, we can rely on CCM / Kubelet to add the well-known label, and the important bit here over on the Node is that we must continue adding the custom label, otherwise PVs may not be scheduled on nodes which do not have the custom label and ebs-plugin node is the only component adding that label on nodes.


Given the above, I believe we can safely proceed to switching the label over on the controller to the well-known label if on Kubernetes. To illustrate:

Node Topology:

  • Custom label: topology.ebs.csi.aws.com/zone-- applied by CSI driver
  • Well-known label: topology.kubernetes.io/zone -- applied by CCM / Kubelet

PV Topology:

topology.ebs.csi.aws.com/zone -- applied by CSI driver

On PVs created by newer driver:

topology.kubernetes.io/zone -- applied by CSI driver if on Kubernetes
topology.ebs.csi.aws.com/zone -- applied by CSI driver if not on Kubernetes

That should fix all the issues and enable scaling from 0, which is explained in detail here. Please let me know if I missed anything and especially any compatibility concerns.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024
@ConnorJC3
Copy link
Contributor Author

Hi @torredil - unfortunately, that will not work, as the AWS CCM is not a requirement to run the EBS CSI Driver. Thus, while your change works with clusters with the AWS CCM installed, it doesn't work for clusters without the AWS CCM installed.

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding the following unit test to node_test.go

if at.Segments[WellKnownTopologyKey] != tc.availabilityZone {
	t.Fatalf("Expected well-known topology %q, got %q", tc.availabilityZone, at.Segments[WellKnownTopologyKey])
}

/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 14, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2024
Signed-off-by: Connor Catlett <conncatl@amazon.com>
@ConnorJC3 ConnorJC3 force-pushed the update-node-topology branch from b32f203 to ae753df Compare February 14, 2024 17:06
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024
@ConnorJC3
Copy link
Contributor Author

Suggest adding the following unit test to node_test.go

Missed that, thanks! Ready for re-review

@torredil
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024
@torredil
Copy link
Member

/retest

4 similar comments
@ConnorJC3
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor Author

/retest

1 similar comment
@ConnorJC3
Copy link
Contributor Author

/retest

@ConnorJC3
Copy link
Contributor Author

/retest

1 similar comment
@ConnorJC3
Copy link
Contributor Author

/retest

@torredil
Copy link
Member

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2024
@k8s-ci-robot k8s-ci-robot merged commit 289df4e into kubernetes-sigs:master Feb 15, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants