Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InstanceExistsByProviderID to cloud provider interface for CCM #51087

Conversation

prydie
Copy link
Contributor

@prydie prydie commented Aug 22, 2017

What this PR does / why we need it:

Currently, MonitorNode() in the node controller checks with the CCM if a node still exists by calling ExternalID(nodeName). ExternalID is supposed to return the provider id of a node which is not supported on every cloud. This means that any clouds who cannot infer the provider id by the node name from a remote location will never remove nodes that no longer exist.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #50985

Special notes for your reviewer:

We'll want to create a subsequent issue to track the implementation of these two new methods in the cloud providers.

Release note:

Adds `InstanceExists` and `InstanceExistsByProviderID` to cloud provider interface for the cloud controller manager

/cc @wlan0 @thockin @andrewsykim @luxas @jhorwit2

/area cloudprovider
/sig cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Aug 22, 2017
@k8s-ci-robot
Copy link
Contributor

@prydie: GitHub didn't allow me to request PR reviews from the following users: jhorwit2.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

Currently, MonitorNode() in the node controller checks with the CCM if a node still exists by calling ExternalID(nodeName). ExternalID is supposed to return the provider id of a node which is not supported on every cloud. This means that any clouds who cannot infer the provider id by the node name from a remote location will never remove nodes that no longer exist.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #50985

Special notes for your reviewer:

We'll want to create a subsequent issue to track the implementation of these two new methods in the cloud providers.

Release note:

Adds `InstanceExists` and `InstanceExistsByProviderID` to cloud provider interface for the cloud controller manager

/cc @wlan0 @thockin @andrewsykim @luxas @jhorwit2

/area cloudprovider
/sig cluster-lifecycle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 22, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @prydie. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 22, 2017
InstanceExists(nodeName types.NodeName) (bool, error)
// InstanceExistsByProviderID returns true if the instance for the given provider id still is running.
// If false is returned with no error, the instance will be immediately deleted.
// Note: The `cloudprovider.InstanceNotFound` error should not be returned here to notify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not? Please clarify the comment here, it's not obvious what you mean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luxas do you think I should just omit that? I added it originally since people are use to using that error to denote the instance is gone, so I wanted to call out to not use that. However, calling it out in hindsight seems more confusing. Thoughts on just removing that note?

@@ -372,6 +382,18 @@ func excludeTaintFromList(taints []v1.Taint, toExclude v1.Taint) []v1.Taint {
return newTaints
}

func ensureNodeExistsByProviderIDOrName(instances cloudprovider.Instances, node *v1.Node) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment and/or unit test here please

@luxas
Copy link
Member

luxas commented Aug 22, 2017

/ok-to-test
/assign @thockin @wlan0

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2017
@jhorwit2 jhorwit2 force-pushed the for/upstream/master/ccm-instance-exists branch from a620a66 to 8fc4a7c Compare August 22, 2017 12:44
@jhorwit2
Copy link
Contributor

/retest

@jhorwit2
Copy link
Contributor

Oh whoops my changes aren't being picked up... weird.

@wlan0
Copy link
Member

wlan0 commented Aug 22, 2017

@prydie @jhorwit2

Thanks for the PR. I reviewed it. The problem with this PR is the ByNameOrID pattern has been followed, but without a reason. The ByName method does not add any value.

Aren't we all moving to addressing nodes by ProviderID? Woudln't it suffice to just add the ByProviderID method?

Additionally, this is breaking existing functionality. Have you tested this change?

Thirdly, We had the ByNameOrProviderID pattern to allow CCM to work without breaking existing functionality. If I merge this, CCM will stop working.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Aug 22, 2017

@wlan0 This is the same pattern followed in the zones PR that you gave your stamp of approval on. Should we opt to remove ByName in both to keep things consistent in the CCM? It also broke existing functionality with the CCM by removing the Zones() call.

The only existing functionality that would be broken is for the CCM, correct? That's the only place the CloudNodeController is created.

@wlan0
Copy link
Member

wlan0 commented Aug 22, 2017

@jhorwit2 The Zones PR required both methods. The GetZone() method did not take any parameter because it was always called from kubelet/always from the node which was querying the zone. There was no existing method for getting zones for a node from a controller either ByName or byID, and since providerID was not always supported, it made sense to have both methods there. That is why it got my stamp of approval.

In this case, externalID is already a method that already exists for finding node existence by Name. Therefore it is unnecessary to add another one.

@wlan0
Copy link
Member

wlan0 commented Aug 22, 2017

The only existing functionality that would be broken is for the CCM, correct? That's the only place the CloudNodeController is created.

@jhorwit2 Yes. This PR would break the functionality of the CCM.

@andrewsykim
Copy link
Member

For what it's worth, this would definitely break DigitalOcean's CCM if I did a vendor update, but it would be acceptable since CCM is technically alpha still (right?).

Another thing to consider though is that we'll be making it more difficult for existing cloud providers to migrate since we're adding more friction to integrate CCM (having to implement InstanceExists* methods). On the plus side, adding this change now will mean we can deprecate ExternalID faster since there will be less users of it in the future.

@jhorwit2
Copy link
Contributor

Zones technically breaks existing functionality as well just in a different way. Any existing clusters managing by the CCM will have nodes with a random failure zone/region value set by the current CCM implementations i've seen. The CCM has no way to correct that now with the updated Zone logic.

I'm all for changing to only using provider id and moving to one method. I guess the point I'm trying to make is why support continue to ExternalID in the CCM if it's a deprecated method?

If you want I'll change the current check to something like

func ensureNodeExistsByProviderIDOrName(instances cloudprovider.Instances, node *v1.Node) (bool, error) {
	exists, err := instances.InstanceExistsByProviderID(node.Spec.ProviderID)
	if err != nil {
		providerIDErr := err
		_, err = instances.ExternalID(types.NodeName(node.Name))
		if err == cloudprovider.InstanceNotFound {
			return false, nil
		}
		if err != nil {
			return false, fmt.Errorf("InstanceExists: Error fetching by providerID: %v Error fetching by NodeName: %v", providerIDErr, err)
		}
		return true, nil
	}
	
	return exists, nil
}

so we keep using ExternalID although I don't really agree with keeping it around in the CCM. Any cloud providers that move to the CCM will already have to implement new methods (zones for example), so why not also implement this to have the desired state up front and drop support for the ExternalID call?

@wlan0
Copy link
Member

wlan0 commented Aug 22, 2017

Zones technically breaks existing functionality as well just in a different way. Any existing clusters managing by the CCM will have nodes with a random failure zone/region value set by the current CCM implementations i've seen. The CCM has no way to correct that now with the updated Zone logic.

Yes, this needs to be fixed. The zones API does not add this bug though. Actually, It gets one step closer to fixing this.

Also, its not a random zone/region that gets assigned. It is the zone in which the CCM is running.

I'm all for changing to only using provider id and moving to one method. I guess the point I'm trying to make is why support continue to ExternalID in the CCM if it's a deprecated method?

It is not deprecated. Where did you get this information?

@wlan0
Copy link
Member

wlan0 commented Aug 22, 2017

For what it's worth, this would definitely break DigitalOcean's CCM if I did a vendor update, but it would be acceptable since CCM is technically alpha still (right?).

Yes, and others. It is not acceptable to introduce breaking changes especially since we don't have all the cloudvendors ready to fill the voids that fix the breaks. It is always advisable to make non-breaking changes.

Another thing to consider though is that we'll be making it more difficult for existing cloud providers to migrate since we're adding more friction to integrate CCM (having to implement InstanceExists* methods). On the plus side, adding this change now will mean we can deprecate ExternalID faster since there will be less users of it in the future.

yes that is correct. Also, deprecation of ExternalID is not something that has been decided. Why are we all working under that assumption?

@prydie prydie changed the title Add InstanceExists* methods to cloud provider interface for CCM Add InstanceExistsByProviderID to cloud provider interface for CCM Aug 25, 2017
@caesarxuchao caesarxuchao removed their assignment Aug 25, 2017
@@ -1101,6 +1101,11 @@ func (c *Cloud) ExternalID(nodeName types.NodeName) (string, error) {
return orEmpty(instance.InstanceId), nil
}

// InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it hurts to add comments from the cloud interface here for visibility

If false is returned with no error, the instance will be immediately deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

instances, _ := fc.Instances()
exists, err := ensureNodeExistsByProviderIDOrName(instances, tc.node)
if err != nil {
t.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, t.Fatal is for when you fail to setup a test properly, t.Error for when a test check fails

t.Fatalf("expected exist by provider id to be `%t` but got `%t`", tc.existsByProviderID, exists)
}

if tc.existsByNodeName && tc.existsByNodeName != exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tc.existsByNodeName != exists is fine?

t.Fatalf("expected cloud provider methods `%v` to be called but `%v` was called ", tc.expectedCalls, fc.Calls)
}

if tc.existsByProviderID && tc.existsByProviderID != exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tc.existsByProviderID != exists should be fine here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work since you have two things to check to determine existence.

Test Case: if exists should be true because it exists by node name and not provider it, this test would fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a previous way people were testing these "by provider id or node name" methods, so if you have a better idea i'm all ears :)

t.Fatalf("expected exist by node name to be `%t` but got `%t`", tc.existsByNodeName, exists)
}

if !tc.existsByNodeName && !tc.existsByProviderID && exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems unesscary, if tc.existsByNodeName and tc.existsByProviderID are properly set then the previous checks should have failed the test?

@luxas
Copy link
Member

luxas commented Aug 25, 2017

ping me when all other reviewers' comments are addressed please

@jhorwit2
Copy link
Contributor

@luxas i've addressed all the comments.

@thockin thockin assigned thockin and unassigned thockin Aug 25, 2017
@thockin
Copy link
Member

thockin commented Aug 25, 2017

Assign me AFTER it is lgtm'ed :)

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small-ish comments and clarifications
On a high level, LGTM though

@@ -1101,6 +1101,12 @@ func (c *Cloud) ExternalID(nodeName types.NodeName) (string, error) {
return orEmpty(instance.InstanceId), nil
}

// InstanceExistsByProviderID returns true if the instance with the given provider id still exists and is running.
// If false is returned with no error, the instance will be immediately deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... by the cloud controller manager?

}(node.Name)
}
glog.Errorf("Error getting node data from cloud: %v", err)
exists, err := ensureNodeExistsByProviderIDOrExternalID(instances, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we deprecated the ExternalID name everywhere, why not ensureNodeExistsByProviderIDOrNodeName?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) that's what I had before, but changed it based on @wlan0's feedback.

exists, err := instances.InstanceExistsByProviderID(node.Spec.ProviderID)
if err != nil {
providerIDErr := err
_, err = instances.ExternalID(types.NodeName(node.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not instances.NodeName like we talked about?
Isn't that implemented yet? (Did it stall out in a PR somewhere?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall us discussing that. Mind filling me in??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we landed on keeping the ExistsByProviderID & ExternalID checks. Then in the future we can add a ExistsByNodeName if we want to, but avoid adding that method now since it's a duplicate of ExternalID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, never mind 😄
That was the case indeed, recalled incorrectly

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
/assign @thockin

@luxas luxas added this to the v1.8 milestone Aug 25, 2017
@luxas
Copy link
Member

luxas commented Aug 25, 2017

/retest

@jhorwit2
Copy link
Contributor

bazel flake: #51379

@thockin
Copy link
Member

thockin commented Aug 25, 2017

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: prydie, thockin

Associated issue: 50985

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2017
@jhorwit2
Copy link
Contributor

/test pull-kubernetes-e2e-gce-bazel

@jhorwit2
Copy link
Contributor

jhorwit2 commented Aug 26, 2017 via email

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 51174, 51363, 51087, 51382, 51388)

@k8s-github-robot k8s-github-robot merged commit 27fbb68 into kubernetes:master Aug 26, 2017
dims pushed a commit to dims/kubernetes that referenced this pull request Feb 8, 2018
…cm-instance-exists

Automatic merge from submit-queue (batch tested with PRs 51174, 51363, 51087, 51382, 51388)

Add InstanceExistsByProviderID to cloud provider interface for CCM

**What this PR does / why we need it**:

Currently, [`MonitorNode()`](https://github.com/kubernetes/kubernetes/blob/02b520f0a40be2056d91fc0661c2b4fdb2694c30/pkg/controller/cloud/nodecontroller.go#L240) in the node controller checks with the CCM if a node still exists by calling `ExternalID(nodeName)`. `ExternalID` is supposed to return the provider id of a node which is not supported on every cloud. This means that any clouds who cannot infer the provider id by the node name from a remote location will never remove nodes that no longer exist. 


**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#50985

**Special notes for your reviewer**:

We'll want to create a subsequent issue to track the implementation of these two new methods in the cloud providers.

**Release note**:

```release-note
Adds `InstanceExists` and `InstanceExistsByProviderID` to cloud provider interface for the cloud controller manager
```

/cc @wlan0 @thockin @andrewsykim @luxas @jhorwit2

/area cloudprovider
/sig cluster-lifecycle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet