Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Revisiting ProxyState to fix draining sequence #41722

Merged
merged 29 commits into from
Jan 22, 2024
Merged
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
39c1447
Merge `start_new_ready_check` into `is_ready`
alexeykudinkin Dec 7, 2023
0231ccf
Cleaned up `try_update_status`;
alexeykudinkin Dec 7, 2023
0ebfe1b
Streamlined `is_healthy`, `is_ready` checks in `ProxyActorWrapper`
alexeykudinkin Dec 7, 2023
b6b3660
Tidying up
alexeykudinkin Dec 7, 2023
feffe80
Simplified `is_healthy`, `is_ready` checks even more
alexeykudinkin Dec 7, 2023
2144898
Fixed `is_drained` check to properly return whether proxy actor is pr…
alexeykudinkin Dec 7, 2023
d04d010
Streamlined proxy's state reconciliation flow
alexeykudinkin Dec 7, 2023
ab1cc40
Simplified readiness, health, drain checks;
alexeykudinkin Dec 8, 2023
5ba6068
Make sure exceptions are properly handled in proxy state reconciliation
alexeykudinkin Dec 8, 2023
d9ef34b
Cleaned up docs
alexeykudinkin Dec 8, 2023
e7c5e1e
Added test for ActorProxyWrapper
alexeykudinkin Dec 8, 2023
759d904
Fixed typos;
alexeykudinkin Dec 8, 2023
953343a
Fixed tests
alexeykudinkin Dec 8, 2023
73d881e
Fixed more tests
alexeykudinkin Dec 8, 2023
9e05832
Fixed some more tests
alexeykudinkin Dec 8, 2023
8d0cdcd
`lint`
alexeykudinkin Dec 8, 2023
136df1c
Fixing typo
alexeykudinkin Dec 8, 2023
189cad4
After rebase clean-up
alexeykudinkin Jan 10, 2024
c8dcb57
Tidying up
alexeykudinkin Jan 10, 2024
06b1755
Make proxy health-check configurable
alexeykudinkin Jan 11, 2024
f4283df
Unify `is_drained` API to require timeout be explicitly provided
alexeykudinkin Jan 11, 2024
b787188
Missing log statement
alexeykudinkin Jan 11, 2024
adfd2f3
Tidying up
alexeykudinkin Jan 18, 2024
eff6f23
Cleaned up exception handling
alexeykudinkin Jan 18, 2024
959feba
`_try_cancel_future` -> `_try_set_exception`
alexeykudinkin Jan 18, 2024
c5d2034
Make timeout_s optional
alexeykudinkin Jan 18, 2024
7002e11
Fixing typo
alexeykudinkin Jan 18, 2024
cc99b4a
Fixing another typo
alexeykudinkin Jan 18, 2024
8a1e954
Fixing tests
alexeykudinkin Jan 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Make sure exceptions are properly handled in proxy state reconciliation
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
  • Loading branch information
alexeykudinkin committed Jan 20, 2024
commit 5ba6068efb80335eaac7ed07c94ceb137fe6f2e6
65 changes: 24 additions & 41 deletions python/ray/serve/_private/proxy_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,16 @@ def update_actor_details(self, **kwargs) -> None:
self._actor_details = ProxyDetails(**details_kwargs)

def reconcile(self, draining: bool = False):
try:
self._reconcile_internal(draining)
except Exception:
self.try_update_status(ProxyStatus.UNHEALTHY)
logger.error(
"Unexpected error occurred when reconciling stae of "
f"proxy on node {self._node_id}:\n{traceback.format_exc()}"
)

def _reconcile_internal(self, draining: bool):
"""Update the status of the current proxy.

The state machine is:
Expand All @@ -409,26 +419,6 @@ def reconcile(self, draining: bool = False):

UNHEALTHY is a terminal state upon reaching which, Proxy is going to be
restarted by the controller

1) When the proxy is already shutting down, in DRAINED or UNHEALTHY status,
do nothing.
2) When the proxy is starting, check ready object reference. If ready
object reference returns a successful call set status to HEALTHY. If the
call to ready() on the proxy actor has any exception or timeout, increment
the consecutive health check failure counter and retry on the next update call.
The status is only set to UNHEALTHY when all retries have exhausted.
3) When the proxy already has an in-progress health check. If health check
object returns a successful call, keep the current status. If the call has
any exception or timeout, count towards 1 of the consecutive health check
failures and retry on the next update call. The status is only set to UNHEALTHY
when all retries have exhausted.
4) When the proxy need to setup another health check (when none of the
above met and the time since the last health check is longer than
PROXY_HEALTH_CHECK_PERIOD_S with some margin). Reset
self._last_health_check_time and set up a new health check object so the next
update can call healthy check again.
5) Transition the status between HEALTHY and DRAINING.
6) When the proxy is draining, check whether it's drained or not.
"""
if (
self._shutting_down
Expand All @@ -443,27 +433,20 @@ def reconcile(self, draining: bool = False):
) * PROXY_READY_CHECK_TIMEOUT_S

if self._status == ProxyStatus.STARTING:
try:
is_ready_response = self._actor_proxy_wrapper.is_ready(ready_check_timeout)
if is_ready_response is not None:
if is_ready_response:
self.try_update_status(ProxyStatus.HEALTHY)
self.update_actor_details(
worker_id=self._actor_proxy_wrapper.worker_id,
log_file_path=self._actor_proxy_wrapper.log_file_path,
status=self._status,
)
else:
self.try_update_status(ProxyStatus.UNHEALTHY)
logger.warning(
f"Proxy actor reported not ready on node {self._node_id}"
)
except Exception:
self.try_update_status(ProxyStatus.UNHEALTHY)
logger.warning(
"Unexpected error occurred when checking readiness of "
f"proxy on node {self._node_id}:\n{traceback.format_exc()}"
)
is_ready_response = self._actor_proxy_wrapper.is_ready(ready_check_timeout)
if is_ready_response is not None:
if is_ready_response:
self.try_update_status(ProxyStatus.HEALTHY)
self.update_actor_details(
worker_id=self._actor_proxy_wrapper.worker_id,
log_file_path=self._actor_proxy_wrapper.log_file_path,
status=self._status,
)
else:
self.try_update_status(ProxyStatus.UNHEALTHY)
logger.warning(
f"Proxy actor reported not ready on node {self._node_id}"
)
else:
# At this point, the proxy is either in HEALTHY or DRAINING status.
assert self._status in {ProxyStatus.HEALTHY, ProxyStatus.DRAINING}
Expand Down