Skip to content

[core] cache find_gcs_addresses#61065

Open
rueian wants to merge 2 commits intoray-project:masterfrom
rueian:cache-find-gcs-address
Open

[core] cache find_gcs_addresses#61065
rueian wants to merge 2 commits intoray-project:masterfrom
rueian:cache-find-gcs-address

Conversation

@rueian
Copy link
Contributor

@rueian rueian commented Feb 14, 2026

Description

Executing the find_gcs_addresses function on a head node, where the RAY_ADDRESS env is not set, will scan the user's process list and find GCS addresses from each of the command line arguments in the list. The process is super expensive, as shown in the screenshot below. It dominates the detect function in Ray Data's hanging detector.

image

This PR caches the result of find_gcs_addresses if any.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces caching for the find_gcs_addresses function using @functools.cache. While this is a good performance optimization, it introduces the risk of returning stale data if GCS processes change during the lifetime of the client process. This could lead to connection issues in dynamic environments where clusters are started and stopped. My review comment details this concern and suggests potential mitigations.

@rueian rueian force-pushed the cache-find-gcs-address branch from d454587 to b111d45 Compare February 14, 2026 06:50
@rueian rueian changed the title [core] cache find_gcs_addresses [core] cache get_address_for_submission_client Feb 14, 2026
@rueian rueian force-pushed the cache-find-gcs-address branch from b111d45 to 3b58b5f Compare February 14, 2026 07:09
@rueian rueian added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Feb 14, 2026
@rueian rueian force-pushed the cache-find-gcs-address branch from 3b58b5f to e00ae0e Compare February 14, 2026 17:46
@github-actions
Copy link

github-actions bot commented Mar 1, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 1, 2026
@rueian rueian force-pushed the cache-find-gcs-address branch from e00ae0e to 5aa9f3f Compare March 12, 2026 00:13
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 12, 2026
@rueian rueian force-pushed the cache-find-gcs-address branch 7 times, most recently from fef7c0c to a31c8bb Compare March 14, 2026 17:37
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian force-pushed the cache-find-gcs-address branch from a31c8bb to b6a6be1 Compare March 14, 2026 22:56
@rueian rueian changed the title [core] cache get_address_for_submission_client [core] cache find_gcs_addresses Mar 16, 2026
@rueian rueian marked this pull request as ready for review March 16, 2026 16:42
@rueian rueian requested a review from a team as a code owner March 16, 2026 16:42
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6a6be1592

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +373 to +375
@cache
def _cached_find_gcs_addresses():
return frozenset(_find_address_from_flag("--gcs-address"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate cached GCS addresses after external Ray stop

Caching find_gcs_addresses() at module scope makes the first non-empty process scan persist for the lifetime of the Python process, but only a subset of call paths clear it. I checked callers like canonicalize_bootstrap_address_or_die used by python/ray/dashboard/utils.py and python/ray/_private/internal_api.py; those paths can run after a cluster is stopped/restarted externally (e.g., ray stop in another process), and they will continue to reuse the stale cached address instead of re-scanning, causing auto-discovery to target a dead GCS endpoint.

Useful? React with 👍 / 👎.

Comment on lines +383 to +386
addresses = _cached_find_gcs_addresses()
if not addresses:
_cached_find_gcs_addresses.cache_clear()
return set(addresses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not atomic right? Between the call and the cache_clear, is it possible another thread could read the (now-stale empty) cached value?

Also, now there is a new set() from the cached frozenset on every call, which is intentional (to avoid callers mutating the cached data), but given this is on the hot path, it's worth confirming the set() copy cost is negligible compared to the process scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not atomic right? Between the call and the cache_clear, is it possible another thread could read the (now-stale empty) cached value?

Correct. It is possible that another thread reads an empty cached value and clears the cache again. Upon checking, I think that currently there is no such concurrent thread that uses the function, but adding a thread lock might still be a good idea.

it's worth confirming the set() copy cost is negligible compared to the process scan.

A copy is currently necessary because the caller will alter the result. But we can also make the caller not alter the result.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7640716d26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +387 to +390
with _find_gcs_addresses_lock:
addresses = _cached_find_gcs_addresses()
if not addresses:
_cached_find_gcs_addresses.cache_clear()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add invalidation strategy for cached GCS address scans

Caching find_gcs_addresses() indefinitely means any non-empty process-scan result can outlive the actual cluster lifecycle, and only a few paths now clear it manually (worker.py/cluster_utils.py). Callers that do not clear this cache (for example scripts.start's preflight check in python/ray/scripts/scripts.py and canonicalize_bootstrap_address_or_die users) can keep using dead addresses after a local ray stop/restart in another process, causing false "already running" or failed auto-connect behavior until interpreter restart. Fresh evidence beyond the existing comment is that scripts.start still consults find_gcs_addresses() directly without any corresponding cache clear.

Useful? React with 👍 / 👎.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian force-pushed the cache-find-gcs-address branch from 7640716 to 546a0de Compare March 16, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants