-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multinode cluster: fix waits and joins #10758
Conversation
m := config.MachineName(cc, *n) | ||
api, err := machine.NewAPIClient() | ||
if err != nil { | ||
return n, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this PR we will not be attempting to delete a Host that we can not load it's config.
this might cause some un-intended sideeffects.
could u comment why we need to return error instead of trying anyways to clean up the Host ?
if no good reason, consider klog.ErrorS and try the delete anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i'm wrong, but in this place, it will return if it cannot get api client (needed for delete), not if it cannot load the config?
did you perhaps referred to the previous call to remove func? i do agree it should not return on err there, but continue and try to delete the host anyway, just not sure atm if we would also need to do anything additionally to clean up the leftovers
err = machine.DeleteHost(api, m) | ||
if err != nil { | ||
return n, err | ||
} | ||
|
||
cc.Nodes = append(cc.Nodes[:index], cc.Nodes[index+1:]...) | ||
return n, config.SaveProfile(viper.GetString(config.ProfileName), &cc) | ||
return n, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code deletes the SaveProfile. are u sure this needed to be done ? what was the reason that this had to be removed as part of this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is still done, just at the end of the remove func (where the node is effectively removed from the cluster)
remove func is called at the beginning of this delete func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the reason that was removed from this PR ? was it related ? or could it be done in a separate PR ?
/ok-to-test |
kvm2 Driver |
kvm2 Driver Times for Minikube (PR 10758): 61.8s 60.4s 61.3s Averages Time Per Log
docker Driver Times for Minikube (PR 10758): 26.2s 24.8s 25.8s Averages Time Per Log
|
@prezha the none driver test says this which seems odd, since in NoneDriver there is no Multi-node
might also wanna rebase from upstream |
622b4e3
to
dc476fe
Compare
good spot, thank you @medyagh ! here is what i've found:
extract:
in contrast, in sigle-node clusters with
extract:
extract:
so, i've replaced code segment in this pr:
with:
and that should fix it // side thought: i'm not sure why none driver sets the name of its node (in contrast to other drivers)? |
kvm2 Driver Times for Minikube (PR 10758): 63.3s 64.1s 62.6s Averages Time Per Log
docker Driver Times for Minikube (PR 10758): 31.2s 26.0s 25.9s Averages Time Per Log
|
kvm2 Driver Times for Minikube (PR 10758): 63.7s 65.0s 65.4s Averages Time Per Log
docker Driver Times for Minikube (PR 10758): 27.2s 26.9s 26.9s Averages Time Per Log
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @prezha this PR made the Docker Linux Tests have 0 Failures !!! nice work
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: medyagh, prezha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fixes #10640
multinode tests usually fail for at least these three reasons:
("kubeadm should take action on preventing the bootstrap of such nodes on kubeadm join time.")
kubeadm: prevent bootstrap of nodes with known names kubernetes#81056
("If a Node name in the cluster is already taken and this Node is Ready, prevent TLS bootsrap on "kubeadm join" and exit early.")
so, trying to rejoin 'old' node timeouts after 3mins with 'error execution phase kubelet-start: a Node with name "name" and status "Ready" already exists in the cluster. You must delete the existing Node or change the name of this new joining Node', then deleting that node and creating a new one that can then join the cluster w/o issues (but it all takes additional time)
now, this pr addresses these issues by:
i've successfully tested multinode and functional suites with docker and kvm drivers, and, as we avoid unnecessary waits and timeouts, they should now also be significantly faster, but let's wait for official results from @medyagh ;)