-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Revisiting ProxyState
to fix draining sequence
#41722
Conversation
e0080b8
to
73909e2
Compare
ProxyState
to fix draining sequence
73909e2
to
d924379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very difficult to see where the actual behavior fix is here -- let's split out the targeted bug fix and the general refactoring into two separate PRs please. This is currently a very large change that adds unnecessary risk (especially if cherry-picked).
(Let me know if I'm mistaken and all of the refactoring is necessary to fix the issue)
# Yield the event-loop | ||
await asyncio.sleep(0.001) | ||
|
||
assert aio_fut.done() | ||
assert aio_fut.result() == "test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ray._private.test_utils.async_wait_for_condition
as a common pattern for adding these assertions that require the loop to be yielded
ProxyState
to fix draining sequenceProxyState
to fix draining sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revisit the code and trying to make it better! I think the biggest point is would be nice to keep ProxyWrapperCallStatus
around and revert the update()
method name. The logics seems to be consistent as before, but hopefully any regression will be caught by the CI (currently passing 😄)
self._ready_check_future = wrap_as_future( | ||
self._actor_handle.ready.remote(), timeout_s=timeout_s | ||
) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (non-blocker): probably don't need to explicitly return None here. We can directly go into checking whether the future is done below. If the ready call is fast enough, this will help save a loop and start time a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to call out that it won't really pay off since this code executing is occupying the event-loop, so it only will save us a loop when we create a future that immediately returns (ie failed). But i like general direction of the thinking here and i don't think i see the downside of doing it
return None | ||
|
||
def is_drained( | ||
self, timeout_s: float = PROXY_HEALTH_CHECK_TIMEOUT_S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be PROXY_DRAIN_CHECK_PERIOD_S
XP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make a timeout for drain check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can generalize HC timeout (i don't think we need a separate one for everything, provided they serve practically identical purpose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we decided to just want to use PROXY_HEALTH_CHECK_TIMEOUT_S
for all the timeouts, maybe we should add a comment to that constant or preferably rename the constant.
Also is there a reason we should add a default for is_drained
but not for is_ready
and is_healthy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call. Will make them coherent.
#41744) (#41755) This is a minified version of https://github.com/ray-project/ray/pull/41722/files, specifically put to be cherry-picked into 2.9 Addresses #41726 Addresses following gaps: Patches ActorProxyWrapper.is_drained method to handle RPC response properly Cherry-picks a test from [Serve] Revisiting ProxyState to fix draining sequence #41722 to validate draining sequence is correct --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
8faf7f7
to
74dcaf4
Compare
return None | ||
|
||
def is_drained( | ||
self, timeout_s: float = PROXY_HEALTH_CHECK_TIMEOUT_S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we decided to just want to use PROXY_HEALTH_CHECK_TIMEOUT_S
for all the timeouts, maybe we should add a comment to that constant or preferably rename the constant.
Also is there a reason we should add a default for is_drained
but not for is_ready
and is_healthy
?
Is this ready for review @alexeykudinkin ? |
Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel we should keep ProxyWrapperCallStatus
enum around, but not a hard blocker. Thanks for revising this logic! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few stylistic comments and suggestions aroundwrap_as_future
.
if self._ready_check_future is None: | ||
self._ready_check_future = wrap_as_future( | ||
self._actor_handle.ready.remote(), timeout_s=timeout_s | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe an unnecessary optimization, but should we just early-return None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! That's actually what it's been before, but @GeneDer made a good point on just letting it fall through so that if future has failed right away we don't wait for another iteration
https://github.com/ray-project/ray/pull/41722/files#r1421100067
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's this thread #41722 (comment) But I don't have strong preference for either 😛
TimeoutError(f"Future cancelled after timeout {timeout_s}s"), | ||
) | ||
# Cancel timeout handler upon completion of the future | ||
aio_fut.add_done_callback(lambda _: timeout_handler.cancel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure that because the timeout_handler
does not catch TimeoutError
, it can log a nasty exception traceback ("unhandled exception in coroutine ..."). Can you add an except TimeoutError: pass
block to the timeout handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout_handler
is just a handler for _try_set_exception
(_try_cancel_future
previously) why would it need to handle TimeoutError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'm sorry, typo, I meant asyncio.CancelledError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This is actually a TimerHandle
not the Future, so cancelling it actually won't be throwing CancelledError
:
def cancel(self):
if not self._cancelled:
self._cancelled = True
if self._loop.get_debug():
# Keep a representation in debug mode to keep callback and
# parameters. For example, to log the warning
# "Executing <Handle...> took 2.5 second"
self._repr = repr(self)
self._callback = None
self._args = None
@edoakes feedback addressed, PTAL |
@alexeykudinkin DCO and test failure |
@alexeykudinkin just FYI that test failure has to do with the latest Starlette version. If you just rebase to the latest master, it should automatically skipping that test and allow merge. We have a separate PR to fix the starlette issue #42378 |
b700fba
to
b947610
Compare
@alexeykudinkin I think those are already approved by us right? Is there anything new that you want us to look in particular? DCO is still failing bc the email you used in that commit is alexey.kudinkin@gmail.com instead of ak@anyscale.com. I think you can amend that commit with the anyscale email and force push again then it should be resolved :) |
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Rebased `set_status` calls to invoke `try_update_status` instead (for proper state-management); Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…operly drained; Missing changes Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Clean up `ProxyWrapperCallStatus` Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Added logging Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
b947610
to
8a1e954
Compare
Sorry, i missed that @edoakes had already approved it as well. Thanks for reviewing it! |
Why are these changes needed?
These changes are revisiting ProxyState to
ProxyState
reconciliation sequenceChanges
ProxyActor
fromProxyState
toActorProxyWrapper
(substantially simplifying it)Related issue number
Closes #41726 (added test)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.