Revert raylet to worker GRPC communication back to asio #5450

pcmoritz · 2019-08-13T22:09:27Z

What do these changes do?

This reverts #5120 for the release, which caused problems with message reordering (#5411), raylet-worker heartbeats (#5343 and #5120), and other problems we tried to address #5341, #5296, #5313

Ideally, we would keep protobuf for communicating between workers and raylets, but it is better to reset to a known working state and bring the changes back incrementally.

It deploys a workaround for the Java tests (https://github.com/ray-project/ray/pull/5450/files#diff-4eba54416056dabdaeb6d540848b3c62R31), which we can hopefully remove later. The reason was described by @raulchen:

I'm getting the following error on my machine with your PR. I seems that it's because sometimes raylet won't gracefully exit and thus won't clean up the socket file (Java use a fixed socket file for all tests /tmp/ray/sockets/raylet)

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
*** Aborted at 1565873121 (unix time) try "date -d @1565873121" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff5e05f2c6) received by PID 63811 (TID 0x111f255c0) stack trace: ***
    @     0x7fff5e10fb5d _sigtramp
    @     0x7fff5b1b2998 GCC_except_table51
    @     0x7fff5dfc96a6 abort
    @     0x7fff5b1a5641 abort_message
    @     0x7fff5b1a57c7 default_terminate_handler()
    @     0x7fff5c758eeb _objc_terminate()
    @     0x7fff5b1b119e std::__terminate()
    @     0x7fff5b1b0f86 __cxxabiv1::failed_throw()
    @     0x7fff5b1a3f99 __cxa_throw
    @        0x108704e7b boost::throw_exception<>()
    @        0x108704d94 boost::asio::detail::do_throw_error()
    @        0x108704d23 boost::asio::detail::throw_error()
    @        0x108823f57 boost::asio::basic_socket_acceptor<>::basic_socket_acceptor()
    @        0x108800234 boost::asio::basic_socket_acceptor<>::basic_socket_acceptor()
    @        0x1087ffac9 ray::raylet::Raylet::Raylet()
    @        0x108800e0f ray::raylet::Raylet::Raylet()
    @        0x1086f651e main
    @     0x7fff5df243d5 start

And the reason why the raylet won't exit gracefully is because

F0815 20:54:13.702641 275129792 node_manager.cc:418]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.

However, this error occurs randomly, not specific to any particular case. It seems that raylet would sometimes be blocked.

Related issue number

Linter

I've run scripts/format.sh to lint the changes in this PR.

edoakes

LGTM once all tests pass

BUILD.bazel

edoakes · 2019-08-14T00:02:31Z

java/runtime/src/main/java/org/ray/runtime/raylet/RayletClientImpl.java

@@ -39,9 +39,9 @@
  private long client = 0;

  // TODO(qwang): JobId parameter can be removed once we embed jobId in driverId.
-  public RayletClientImpl(String schedulerSockName, UniqueId workerId,
+  public RayletClientImpl(String schedulerSockName, UniqueId clientId,


I don't think we should revert the clientId->workerId change unless this causes problems

edoakes · 2019-08-14T00:03:55Z

python/ray/__init__.py

-# initialization of grpc if we import pyarrow at first.
-# NOTE(JoeyJiang): See https://github.com/ray-project/ray/issues/5219 for more
-# details.
-import ray._raylet


We should leave this in

Hmmm, I prefer not to. It should be fixed properly, once grpc is brought back into the client. This kind of stuff is just a ticking time bomb.

src/ray/core_worker/transport/raylet_transport.cc

edoakes · 2019-08-14T00:07:30Z

src/ray/raylet/main.cc

@@ -171,7 +171,6 @@ int main(int argc, char *argv[]) {
      server.reset();
      gcs_client->Disconnect();
      main_service.stop();
-      RAY_LOG(INFO) << "Raylet server received SIGTERM message, shutting down...";


shouldn't revert

src/ray/raylet/main.cc

src/ray/raylet/raylet.h

src/ray/raylet/task_dependency_manager.cc

src/ray/raylet/worker.h

src/ray/rpc/grpc_server.h

edoakes · 2019-08-14T00:44:59Z

BUILD.bazel

@@ -102,7 +102,7 @@ cc_proto_library(

 # === Begin of rpc definitions ===

-# gRPC common lib.
+# GRPC common lib.


don't revert

AmplabJenkins · 2019-08-14T00:50:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16269/
Test FAILed.

AmplabJenkins · 2019-08-14T03:22:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16272/
Test FAILed.

AmplabJenkins · 2019-08-14T05:52:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16277/
Test FAILed.

AmplabJenkins · 2019-08-14T07:16:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16283/
Test FAILed.

zhijunfu · 2019-08-14T12:23:10Z

Thanks for doing this.

It's probably very difficult to preserve all the grpc changes for feature-flag, but sounds like it's simple to preserve the worker-to-raylet proto and rpc client/server files, so that it would be easier to integrate grpc back for worker-to-raylet communication once the ordering issue is resolved.

edoakes · 2019-08-14T17:06:11Z

I don't think we should keep those proto definitions/implementations in master if they aren't being used. It will be just as easy to add them later from the old commits as they're standalone files and we should generally prevent dead code as much as possible.

raulchen

I'm afraid that it's not a good idea to simply revert the whole PR. Because:

it'd be very hard to test and verify if gRPC can work in the future. We are likely to have to keep both asio and grpc in the code base for long term.
Besides changing the communication lib, [gRPC] Migrate raylet client implementation to grpc #5120 also includes some nice optimizations. e.g., consolidating duplicated message definitions, changing worker lookup to use worker_id, fixing the SWAP queue hack, etc (see my in-line comments).

I'm still concerned about doing the revert could waste a lot of time. Fixing the issue should be much easier. For example,

Make FetchOrReconstruct sync (FetchOrReconstruct is only called when the worked is blocked on waiting objects, so making it sync won't actually hurt perf).
Make the requests idempotent as @ericl suggested (but I don't fully understand this solution yet. @ericl could you please elaborate? thanks)
Use [WIP] Handle request order for gRPC #5453.

Any of these 3 fixes looks easier than reverting.

However, if you insist on reverting, could you please only change the underlying communication lib, and keep other things (e.g., message definitions in protobuf)? Thanks in advance.

raulchen · 2019-08-15T03:42:35Z

BUILD.bazel

@@ -65,20 +65,6 @@ cc_proto_library(
    deps = [":object_manager_proto"],
 )

-proto_library(
-    name = "raylet_proto",


Can we still define the messages in protobuf (but use asio for communication)? Because some message types are shared by other modules as well. And we can get rid of flatbuffers dependency as well.

raulchen · 2019-08-15T03:45:54Z

src/ray/protobuf/worker.proto

-  repeated ResourceIdSetInfo resource_ids = 2;
+  // TODO(zhijunfu): `resource_ids` is represented as
+  // flatbutters-serialized bytes, will be moved to protobuf later.
+  bytes resource_ids = 2;


we should keep ResourceIdSetInfo definition in protobuf

raulchen · 2019-08-15T03:56:16Z

src/ray/raylet/node_manager.cc

-
-  if (is_worker) {
+void NodeManager::ProcessRegisterClientRequestMessage(
+    const std::shared_ptr<LocalClientConnection> &client, const uint8_t *message_data) {


Could you please keep the code structure, e.g. HandleFooRequest? Because it'd be easier to test and bring back grpc. You can just parse the requests from asio and dispatch them to these functions.

raulchen · 2019-08-15T03:58:44Z

src/ray/raylet/node_manager.cc

    if (use_push_task) {
      // only call `HandleWorkerAvailable` when push mode is used.
-      HandleWorkerAvailable(worker_id);
+      HandleWorkerAvailable(connection);


I think this optimization is still useful. Using connection as key, looking up a worker takes O(n) complexity. While using worker_id, it's O(1).

raulchen · 2019-08-15T04:08:54Z

src/ray/raylet/node_manager.cc

+  auto finish_assign_task_callback = [this, worker, task_id](Status status) {
+    if (worker->UsePush()) {
+      // NOTE: we cannot directly call `FinishAssignTask` here because
+      // it assumes the task is in SWAP queue, thus we need to delay invoking this


This SWAP queue was actually a hack for asio that had been existing for long time in the code base. #5120 fixed this hack. It's unfortunate that we're bringing it back.

It is not actually true that the SWAP queue is needed for asio. It was just one solution for when communication with the worker is asynchronous, which is true in both asio and grpc.

I do agree that the pattern in #5120 was an improvement over this. We could try and keep that here, by just moving tasks directly to the RUNNING queue in DispatchTasks and calling the code in FinishAssignTask directly here (vs calling it in a callback).

raulchen · 2019-08-15T04:15:03Z

src/ray/raylet/raylet_client.cc

@@ -0,0 +1,392 @@
+#include "raylet_client.h"


It'd be better to keep RayletClient in the rpc directory. Because

In concept, RayletClient is used for rpc.

raylet_client can be an independent lib. So worker can only depend on it, instead of the whole raylet_lib.

raulchen · 2019-08-15T04:19:07Z

src/ray/raylet/worker.h

  /// The `ClientCallManager` object that is shared by `WorkerTaskClient` from all
  /// workers.
  rpc::ClientCallManager &client_call_manager_;
-  /// Indicates whether this is a worker or a driver.
-  bool is_worker_;


This is still useful

raulchen · 2019-08-15T04:21:28Z

src/ray/raylet/worker_pool.cc

 }

-std::shared_ptr<Worker> WorkerPool::GetRegisteredWorker(const WorkerID &worker_id) const {
+std::shared_ptr<Worker> WorkerPool::GetRegisteredWorker(


better to use worker_id as the key for looking up workers

We concluded we should do a full revert to make a timely release possible and will bring back the changes incrementally. We will fix gRPC and do more testing to make sure it respects all the invariants of worker<->raylet communication (and relax these as we simplify the raylet).

AmplabJenkins · 2019-08-16T04:19:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16315/
Test FAILed.

robertnishihara · 2019-08-16T04:30:28Z

@pcmoritz Please explain what the issue with the Java test is and what the workaround is. It's not at all clear from https://github.com/ray-project/ray/pull/5450/files#diff-4eba54416056dabdaeb6d540848b3c62R31.

java/test/src/main/java/org/ray/api/test/BaseTest.java

pcmoritz · 2019-08-16T04:36:40Z

According to Hao:

I'm getting the following error on my machine with your PR. I seems that it's because sometimes raylet won't gracefully exit and thus won't clean up the socket file (Java use a fixed socket file for all tests /tmp/ray/sockets/raylet)

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
*** Aborted at 1565873121 (unix time) try "date -d @1565873121" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff5e05f2c6) received by PID 63811 (TID 0x111f255c0) stack trace: ***
    @     0x7fff5e10fb5d _sigtramp
    @     0x7fff5b1b2998 GCC_except_table51
    @     0x7fff5dfc96a6 abort
    @     0x7fff5b1a5641 abort_message
    @     0x7fff5b1a57c7 default_terminate_handler()
    @     0x7fff5c758eeb _objc_terminate()
    @     0x7fff5b1b119e std::__terminate()
    @     0x7fff5b1b0f86 __cxxabiv1::failed_throw()
    @     0x7fff5b1a3f99 __cxa_throw
    @        0x108704e7b boost::throw_exception<>()
    @        0x108704d94 boost::asio::detail::do_throw_error()
    @        0x108704d23 boost::asio::detail::throw_error()
    @        0x108823f57 boost::asio::basic_socket_acceptor<>::basic_socket_acceptor()
    @        0x108800234 boost::asio::basic_socket_acceptor<>::basic_socket_acceptor()
    @        0x1087ffac9 ray::raylet::Raylet::Raylet()
    @        0x108800e0f ray::raylet::Raylet::Raylet()
    @        0x1086f651e main
    @     0x7fff5df243d5 start

pcmoritz · 2019-08-16T04:38:12Z

I can't unfortunately reproduce it. The workaround is there to make sure the Java tests are not broken. It should be properly fixed after the release.

stephanie-wang · 2019-08-16T04:36:32Z

src/ray/raylet/node_manager.cc

+        fbb.Finish(wait_reply);
+
+        auto status =
+            client->WriteMessage(static_cast<int64_t>(protocol::MessageType::WaitReply),


I guess it wasn't like this before, but while we're at it, do you want to make this a WriteMessageAsync?

stephanie-wang · 2019-08-16T04:52:58Z

src/ray/raylet/node_manager.cc

+  auto finish_assign_task_callback = [this, worker, task_id](Status status) {
+    if (worker->UsePush()) {
+      // NOTE: we cannot directly call `FinishAssignTask` here because
+      // it assumes the task is in SWAP queue, thus we need to delay invoking this


It is not actually true that the SWAP queue is needed for asio. It was just one solution for when communication with the worker is asynchronous, which is true in both asio and grpc.

I do agree that the pattern in #5120 was an improvement over this. We could try and keep that here, by just moving tasks directly to the RUNNING queue in DispatchTasks and calling the code in FinishAssignTask directly here (vs calling it in a callback).

pcmoritz · 2019-08-16T05:04:30Z

I agree these are good changes, I think we can bring them back together with the structure of the gRPC handlers (which will also require to convert the serialization back to protobuf). It's a bit more work, let's do it in a follow up PR.

raulchen · 2019-08-16T06:04:31Z

I just merged #5370. There're some small conflicts that should be easy to fix. That PR added return status for RegisterClient to indicate whether worker registration is successful. As asio doesn't have status, you can add a bool successful flag in RegisterClientReply.

AmplabJenkins · 2019-08-16T06:59:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16321/
Test FAILed.

AmplabJenkins · 2019-08-16T07:02:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16322/
Test FAILed.

AmplabJenkins · 2019-08-16T07:10:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16329/
Test FAILed.

AmplabJenkins · 2019-08-16T08:23:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16325/
Test FAILed.

AmplabJenkins · 2019-08-16T12:14:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16338/
Test FAILed.

AmplabJenkins · 2019-08-16T12:52:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16339/
Test FAILed.

AmplabJenkins · 2019-08-17T01:43:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16356/
Test FAILed.

AmplabJenkins · 2019-08-17T03:07:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16357/
Test PASSed.

AmplabJenkins · 2019-08-18T02:10:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16368/
Test PASSed.

edoakes added 3 commits August 13, 2019 14:55

asio for raylet->worker

5041fc2

Remove dead node manager code

1dd871d

Fix core worker raylet client

b85ed45

edoakes approved these changes Aug 14, 2019

View reviewed changes

edoakes mentioned this pull request Aug 14, 2019

[WIP] [experimental] Raylet to worker using local sockets #5422

Closed

1 task

cleanup

55049ba

edoakes reviewed Aug 14, 2019

View reviewed changes

cleanup

ae36619

linting

f110651

zhijunfu requested review from zhijunfu and raulchen August 14, 2019 12:09

raulchen previously requested changes Aug 15, 2019

View reviewed changes

pcmoritz added 6 commits August 15, 2019 11:51

increase heartbeat timeout

483e80f

fix crash

9104cae

workaround (?)

2c26b68

remove hack, keep workaround

40c2a79

update

ed71ad0

Update ResourcesManagementTest.java

9544c73

robertnishihara reviewed Aug 16, 2019

View reviewed changes

java/test/src/main/java/org/ray/api/test/BaseTest.java Show resolved Hide resolved

Update BaseTest.java

ba3fcb0

stephanie-wang reviewed Aug 16, 2019

View reviewed changes

Merge branch 'master' into revert-raylet-worker-grpc

fa36fdb

fix

fad3424

update

3a6d6f8

update numpy

3b454b9

pcmoritz added the release-blocker P0 Issue that blocks the release label Aug 17, 2019

update

0ba2791

pcmoritz added 2 commits August 17, 2019 15:34

Update Dockerfile

6f4479e

lint

2539e2f

pcmoritz merged commit 599cc2b into ray-project:master Aug 18, 2019

pcmoritz deleted the revert-raylet-worker-grpc branch August 18, 2019 02:11

Revert raylet to worker GRPC communication back to asio #5450

Revert raylet to worker GRPC communication back to asio #5450

Conversation

pcmoritz commented Aug 13, 2019 • edited Loading

What do these changes do?

Related issue number

Linter

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Aug 14, 2019

AmplabJenkins commented Aug 14, 2019

AmplabJenkins commented Aug 14, 2019

AmplabJenkins commented Aug 14, 2019

zhijunfu commented Aug 14, 2019

edoakes commented Aug 14, 2019

raulchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Aug 16, 2019

robertnishihara commented Aug 16, 2019

pcmoritz commented Aug 16, 2019 • edited Loading

pcmoritz commented Aug 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz commented Aug 16, 2019

raulchen commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 17, 2019

AmplabJenkins commented Aug 17, 2019

AmplabJenkins commented Aug 18, 2019

pcmoritz commented Aug 13, 2019 •

edited

Loading

pcmoritz commented Aug 16, 2019 •

edited

Loading