Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Revisiting ProxyState to fix draining sequence #41722

Merged
merged 29 commits into from
Jan 22, 2024
Merged

Conversation

alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Dec 8, 2023

Why are these changes needed?

These changes are revisiting ProxyState to

Changes

  • Offloaded async interactions w/ the ProxyActor from ProxyState to ActorProxyWrapper (substantially simplifying it)
  • Streamlined overall reconciliation sequence to make it easier to comprehend
  • Cleaned up tests (avoid sleeping, added validations), made sure all flows are adequately covered

Related issue number

Closes #41726 (added test)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alexeykudinkin alexeykudinkin changed the title [WIP][Serve] Revisiting ProxyState to fix draining sequence, simplify state handling [WIP][Serve] Revisiting ProxyState to fix draining sequence Dec 8, 2023
@alexeykudinkin alexeykudinkin linked an issue Dec 8, 2023 that may be closed by this pull request
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very difficult to see where the actual behavior fix is here -- let's split out the targeted bug fix and the general refactoring into two separate PRs please. This is currently a very large change that adds unnecessary risk (especially if cherry-picked).

(Let me know if I'm mistaken and all of the refactoring is necessary to fix the issue)

Comment on lines +57 to +61
# Yield the event-loop
await asyncio.sleep(0.001)

assert aio_fut.done()
assert aio_fut.result() == "test"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ray._private.test_utils.async_wait_for_condition as a common pattern for adding these assertions that require the loop to be yielded

@alexeykudinkin alexeykudinkin changed the title [WIP][Serve] Revisiting ProxyState to fix draining sequence [Serve] Revisiting ProxyState to fix draining sequence Dec 8, 2023
Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revisit the code and trying to make it better! I think the biggest point is would be nice to keep ProxyWrapperCallStatus around and revert the update() method name. The logics seems to be consistent as before, but hopefully any regression will be caught by the CI (currently passing 😄)

self._ready_check_future = wrap_as_future(
self._actor_handle.ready.remote(), timeout_s=timeout_s
)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocker): probably don't need to explicitly return None here. We can directly go into checking whether the future is done below. If the ready call is fast enough, this will help save a loop and start time a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to call out that it won't really pay off since this code executing is occupying the event-loop, so it only will save us a loop when we create a future that immediately returns (ie failed). But i like general direction of the thinking here and i don't think i see the downside of doing it

return None

def is_drained(
self, timeout_s: float = PROXY_HEALTH_CHECK_TIMEOUT_S
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be PROXY_DRAIN_CHECK_PERIOD_S XP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make a timeout for drain check?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generalize HC timeout (i don't think we need a separate one for everything, provided they serve practically identical purpose)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we decided to just want to use PROXY_HEALTH_CHECK_TIMEOUT_S for all the timeouts, maybe we should add a comment to that constant or preferably rename the constant.

Also is there a reason we should add a default for is_drained but not for is_ready and is_healthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. Will make them coherent.

architkulkarni pushed a commit that referenced this pull request Dec 11, 2023
#41744) (#41755)

This is a minified version of https://github.com/ray-project/ray/pull/41722/files, specifically put to be cherry-picked into 2.9

Addresses #41726

Addresses following gaps:

Patches ActorProxyWrapper.is_drained method to handle RPC response properly
Cherry-picks a test from [Serve] Revisiting ProxyState to fix draining sequence #41722 to validate draining sequence is correct
---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin
Copy link
Contributor Author

@edoakes @GeneDer updated/addressed comments, PTAL

return None

def is_drained(
self, timeout_s: float = PROXY_HEALTH_CHECK_TIMEOUT_S
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we decided to just want to use PROXY_HEALTH_CHECK_TIMEOUT_S for all the timeouts, maybe we should add a comment to that constant or preferably rename the constant.

Also is there a reason we should add a default for is_drained but not for is_ready and is_healthy?

@edoakes
Copy link
Collaborator

edoakes commented Jan 11, 2024

Is this ready for review @alexeykudinkin ?

@alexeykudinkin
Copy link
Contributor Author

Is this ready for review @alexeykudinkin ?

Yes

Copy link
Contributor

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel we should keep ProxyWrapperCallStatus enum around, but not a hard blocker. Thanks for revising this logic! :)

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few stylistic comments and suggestions aroundwrap_as_future.

Comment on lines +192 to +195
if self._ready_check_future is None:
self._ready_check_future = wrap_as_future(
self._actor_handle.ready.remote(), timeout_s=timeout_s
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe an unnecessary optimization, but should we just early-return None here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! That's actually what it's been before, but @GeneDer made a good point on just letting it fall through so that if future has failed right away we don't wait for another iteration

https://github.com/ray-project/ray/pull/41722/files#r1421100067

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's this thread #41722 (comment) But I don't have strong preference for either 😛

TimeoutError(f"Future cancelled after timeout {timeout_s}s"),
)
# Cancel timeout handler upon completion of the future
aio_fut.add_done_callback(lambda _: timeout_handler.cancel())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that because the timeout_handler does not catch TimeoutError, it can log a nasty exception traceback ("unhandled exception in coroutine ..."). Can you add an except TimeoutError: pass block to the timeout handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout_handler is just a handler for _try_set_exception (_try_cancel_future previously) why would it need to handle TimeoutError?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm sorry, typo, I meant asyncio.CancelledError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This is actually a TimerHandle not the Future, so cancelling it actually won't be throwing CancelledError:

def cancel(self):
        if not self._cancelled:
            self._cancelled = True
            if self._loop.get_debug():
                # Keep a representation in debug mode to keep callback and
                # parameters. For example, to log the warning
                # "Executing <Handle...> took 2.5 second"
                self._repr = repr(self)
            self._callback = None
            self._args = None

@alexeykudinkin
Copy link
Contributor Author

@edoakes feedback addressed, PTAL

@edoakes
Copy link
Collaborator

edoakes commented Jan 19, 2024

@alexeykudinkin DCO and test failure

@GeneDer
Copy link
Contributor

GeneDer commented Jan 19, 2024

@alexeykudinkin just FYI that test failure has to do with the latest Starlette version. If you just rebase to the latest master, it should automatically skipping that test and allow merge. We have a separate PR to fix the starlette issue #42378

@alexeykudinkin
Copy link
Contributor Author

@edoakes @GeneDer addressed feedback, PTAL

@GeneDer
Copy link
Contributor

GeneDer commented Jan 20, 2024

@alexeykudinkin I think those are already approved by us right? Is there anything new that you want us to look in particular? DCO is still failing bc the email you used in that commit is alexey.kudinkin@gmail.com instead of ak@anyscale.com. I think you can amend that commit with the anyscale email and force push again then it should be resolved :)

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Rebased `set_status` calls to invoke `try_update_status` instead (for proper state-management);

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
alexeykudinkin and others added 25 commits January 19, 2024 22:41
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…operly drained;

Missing changes

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Clean up `ProxyWrapperCallStatus`

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Added logging

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin
Copy link
Contributor Author

Sorry, i missed that @edoakes had already approved it as well.

Thanks for reviewing it!

@edoakes edoakes merged commit e7ae6d0 into master Jan 22, 2024
2 checks passed
@edoakes edoakes deleted the ak/srv-drn-fix branch January 22, 2024 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Draining Proxy is shutdown prematurely
3 participants