-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vSphere Provider] Fix vc conn timout issue #40516
[vSphere Provider] Fix vc conn timout issue #40516
Conversation
Signed-off-by: Chen Jing <jingch@vmware.com>
181065c
to
c808281
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor questions, but otherwise looks good to me
if not obj: | ||
raise RuntimeError( | ||
f"Unexpected: cannot find vSphere object {vimtype} with moid: {moid}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the reason for the changes to this function?
- Previously
get_pyvmomi_obj_by_*
could not returnNone
(instead it would raise an exception), but now it can return None. Do the callers need to handle the case where it'sNone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think the reason is to simplify the function's logic. Actually when I do the cherry pick for the bug fix, I find that this file has been changed in another PR by some other teammate. So I also cherry-picked that commit.
- Yes, I agree with you that we should never change the behaviour like this when no change to the callers of this function. So let me optimize the function to make sure it will throw an exception when things are not correct.
if not obj: | ||
raise RuntimeError( | ||
f"Unexpected: cannot find vSphere object {vimtype} with name: {name}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the same code changes needed to be made in two places; I still think it would be a bit better to unify the functions, maybe like def get_myvmomi_obj(self, vimtype, name=None, moid=None)
, but it's not strictly related to the bug fixed in this PR and doesn't need to block the PR.
Signed-off-by: Chen Jing <jingch@vmware.com>
Book linkcheck failure unrelated, this PR does not touch the docs. |
…roject#40516) Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown. --------- Signed-off-by: Chen Jing <jingch@vmware.com>
…sue and support GPU nodes (#40667) * [Cluster launcher] [vSphere] Fix the bug that multiple worker types doesn't work (#40487) Currently our code assumes that there is only one worker node type. In this change I fix the bug to let it support multiple worker node types. Signed-off-by: Chen Jing <jingch@vmware.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> * [cluster launcher] [vSphere Provider] Fix vc conn timout issue (#40516) Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [cluster launcher] [vSphere Provider] Support GPU Ray nodes on vSphere (#40616) This is for supporting passthrough the GPU on vSphere ESXi host into the Ray nodes. --------- Signed-off-by: Chen Jing <jingch@vmware.com> * [cluster launcher] [vSphere] Do not fetch runtime-info of vm from cached_nodes (#40655) Power-on-off status is runtime info of VM, should not fetch it from cached-nodes, which is probably dirty data. It should query by pyvmomi_sdk every time. Signed-off-by: Chen Hui <huchen@vmware.com> --------- Signed-off-by: Chen Jing <jingch@vmware.com> Signed-off-by: Chen Hui <huchen@vmware.com> Co-authored-by: Chen Jing <jingch@vmware.com> Co-authored-by: huchen2021 <85480625+huchen2021@users.noreply.github.com>
Why are these changes needed?
Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown.
Issue here is soap request session is timing out after 30 mins on the server side.
Increasing connectionPoolTimeout Value on the client side or setting it to -1 doesn't solve the issue. Tested it. Issue still persists.This is because connectionPooltimeout parameter only close idle connections from the client side, and if the client connection was idle longer than the timeout value, a new connection will be created automatically. Passing connectionPoolTimeout = -1 does not have any effect on the session timeout. -1 only means client will never close an idle connection. Also, it doesn't mean that the connection will be valid forever. A soap request session will be timeout after 30min on server side, regardless whether or not the client closes the connection.
Fixed the issue using SessionOrientedStub. A session-oriented stub adapter that will relogin to the destination if a session-oriented exception is thrown. The stub starts off in the "unauthenticated" state, so it will call the loginMethod on the first invocation of a method. If a communication error is encountered, the stub will wait for retryDelay seconds and then try to call the method again. If the server throws an exception that is in the SESSION_EXCEPTIONS tuple, it will be caught and the stub will transition back into the "unauthenticated" state so that another login will be performed.
Modified the code to use SmartStubAdapter using below as reference: https://github.com/vmware/pyvmomi/blob/912e365564c927dc9201eb6f46262924dc1dd275/pyVmomi/SoapAdapter.py#L1530
https://developer.vmware.com/samples/6889/esxi_certtool?h=connectionPoolTimeout#code
vmware/pyvmomi#347
Test
After doing Ray up, left the env there for more than 30 mins, 1 hour and 18 hours respectively. Then delete one worker node VM from the vSphere client. Verified that a new worker node is created by autoscaler calling the vSphere API, which proves that the session is still alive.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.