Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces caching for the find_gcs_addresses function using @functools.cache. While this is a good performance optimization, it introduces the risk of returning stale data if GCS processes change during the lifetime of the client process. This could lead to connection issues in dynamic environments where clusters are started and stopped. My review comment details this concern and suggests potential mitigations.
d454587 to
b111d45
Compare
b111d45 to
3b58b5f
Compare
3b58b5f to
e00ae0e
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
e00ae0e to
5aa9f3f
Compare
fef7c0c to
a31c8bb
Compare
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
a31c8bb to
b6a6be1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6a6be1592
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @cache | ||
| def _cached_find_gcs_addresses(): | ||
| return frozenset(_find_address_from_flag("--gcs-address")) |
There was a problem hiding this comment.
Invalidate cached GCS addresses after external Ray stop
Caching find_gcs_addresses() at module scope makes the first non-empty process scan persist for the lifetime of the Python process, but only a subset of call paths clear it. I checked callers like canonicalize_bootstrap_address_or_die used by python/ray/dashboard/utils.py and python/ray/_private/internal_api.py; those paths can run after a cluster is stopped/restarted externally (e.g., ray stop in another process), and they will continue to reuse the stale cached address instead of re-scanning, causing auto-discovery to target a dead GCS endpoint.
Useful? React with 👍 / 👎.
python/ray/_private/services.py
Outdated
| addresses = _cached_find_gcs_addresses() | ||
| if not addresses: | ||
| _cached_find_gcs_addresses.cache_clear() | ||
| return set(addresses) |
There was a problem hiding this comment.
This is not atomic right? Between the call and the cache_clear, is it possible another thread could read the (now-stale empty) cached value?
Also, now there is a new set() from the cached frozenset on every call, which is intentional (to avoid callers mutating the cached data), but given this is on the hot path, it's worth confirming the set() copy cost is negligible compared to the process scan.
There was a problem hiding this comment.
This is not atomic right? Between the call and the cache_clear, is it possible another thread could read the (now-stale empty) cached value?
Correct. It is possible that another thread reads an empty cached value and clears the cache again. Upon checking, I think that currently there is no such concurrent thread that uses the function, but adding a thread lock might still be a good idea.
it's worth confirming the set() copy cost is negligible compared to the process scan.
A copy is currently necessary because the caller will alter the result. But we can also make the caller not alter the result.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7640716d26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with _find_gcs_addresses_lock: | ||
| addresses = _cached_find_gcs_addresses() | ||
| if not addresses: | ||
| _cached_find_gcs_addresses.cache_clear() |
There was a problem hiding this comment.
Add invalidation strategy for cached GCS address scans
Caching find_gcs_addresses() indefinitely means any non-empty process-scan result can outlive the actual cluster lifecycle, and only a few paths now clear it manually (worker.py/cluster_utils.py). Callers that do not clear this cache (for example scripts.start's preflight check in python/ray/scripts/scripts.py and canonicalize_bootstrap_address_or_die users) can keep using dead addresses after a local ray stop/restart in another process, causing false "already running" or failed auto-connect behavior until interpreter restart. Fresh evidence beyond the existing comment is that scripts.start still consults find_gcs_addresses() directly without any corresponding cache clear.
Useful? React with 👍 / 👎.
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
7640716 to
546a0de
Compare

Description
Executing the
find_gcs_addressesfunction on a head node, where the RAY_ADDRESS env is not set, will scan the user's process list and find GCS addresses from each of the command line arguments in the list. The process is super expensive, as shown in the screenshot below. It dominates thedetectfunction in Ray Data's hanging detector.This PR caches the result of
find_gcs_addressesif any.Related issues
Additional information