Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS-Ray] update doc and error message for GCS-Ray #22528

Merged
merged 1 commit into from
Feb 23, 2022

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Feb 21, 2022

Why are these changes needed?

Update documentation to reflect that Ray no longer starts Redis by default.

Improvement error messages during Ray bootstrapping for head and worker nodes, when GCS is unavailable at the given address. Currently the error messages are:

2022-02-21 04:57:29,895	ERROR utils.py:1208 -- Internal KV Get failed
Traceback (most recent call last):
  File "/home/ubuntu/ray/python/ray/_private/utils.py", line 1206, in internal_kv_get_with_retry
    result = gcs_client.internal_kv_get(key, namespace)
  File "/home/ubuntu/ray/python/ray/_private/gcs_utils.py", line 146, in wrapper
    return f(self, *args, **kwargs)
  File "/home/ubuntu/ray/python/ray/_private/gcs_utils.py", line 230, in internal_kv_get
    reply = self._kv_stub.InternalKVGet(req)
  File "/home/ubuntu/anaconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ubuntu/anaconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1645419449.895707218","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4142,"referenced_errors":[{"created":"@1645419449.895701231","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>

and:

2022-02-21 05:36:25,442	ERROR utils.py:1234 -- Internal KV Put failed
Traceback (most recent call last):
  File "/home/ubuntu/ray/python/ray/_private/utils.py", line 1230, in internal_kv_put_with_retry
    return gcs_client.internal_kv_put(
  File "/home/ubuntu/ray/python/ray/_private/gcs_utils.py", line 146, in wrapper
    return f(self, *args, **kwargs)
  File "/home/ubuntu/ray/python/ray/_private/gcs_utils.py", line 249, in internal_kv_put
    reply = self._kv_stub.InternalKVPut(req)
  File "/home/ubuntu/anaconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ubuntu/anaconda3/envs/ray/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1645421785.442421870","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4142,"referenced_errors":[{"created":"@1645421785.442418121","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>

which are not very helpful to the users for troubleshooting. The updated message is:

2022-02-20 21:53:52,369	WARNING utils.py:1213 -- Unable to connect to GCS at 127.0.0.1:8080. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

This change should have been sent out when enabling GCS-Ray. I'm asking for this to be cherry picked into the releases/1.11.0 branch.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mwtian mwtian added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 21, 2022
@@ -327,7 +325,7 @@ you can use a tool such as ``nmap`` or ``nc``.
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis syn-ack ttl 60 Redis key-value store
6379/tcp open redis? syn-ack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one, @ericl is this correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I got from running the command. I believe because of the 6379 port, nm guessed redis as the service. But because of not using redis protocol, it is suffixed with ?.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for updating the docs.

@ericl ericl merged commit 9a157df into ray-project:master Feb 23, 2022
fishbone pushed a commit that referenced this pull request Feb 23, 2022
Update documentation to reflect that Ray no longer starts Redis by default.
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
Update documentation to reflect that Ray no longer starts Redis by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants