Skip to content

Conversation

@richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Sep 24, 2018

What do these changes do?

TODO:

  • Have a separate path for explicit disconnects - such as ray_terminate and out of scope.
  • Make sure proper handling of intentionaldisconnect
  • Tests

Related issue number

Addresses #2481

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8336/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8349/
Test FAILed.

@richardliaw
Copy link
Contributor Author

OK some progress has been made here - but now the following error message doesn't show when unintentionally killing worker.

A worker died or was killed while executing task 00000000fbf609bc5b5a9d2fb31bf946fac35a66.

@richardliaw
Copy link
Contributor Author


In [2]: the_pid
Out[2]: 4845

In [3]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:06.338121 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [4]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:12.219475 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [5]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:12.929061 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [6]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:13.493301 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [7]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:14.239035 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [8]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:14.954322 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [9]: actor.__ray_terminate__.remote()
Out[9]: ObjectID(010000003eaa034aa23f3c4e138bf78c46d2d061)

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:53:25.543941 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [10]: actor = Counter.remote()

In [11]: ray.get(actor.pid.remote())
Out[11]: 4851

In [12]: I0925 16:53:57.202100 2316759936 client_connection.cc:133] Error handling happening...6
I0925 16:53:57.205284 2316759936 node_manager.cc:649] ProcessDisconnectClientMessage Started!
In [12]: actor = Counter.remote()

In [13]: ray.get(actor.pid.remote())
Out[13]: 4843

In [14]: actor = Counter.remote()

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0925 16:54:11.261371 2316759936 local_scheduler_client.cc:59] Intentionally disconnecting client 7
In [15]: exit

src/common/io.cc Outdated
/* Handle the case in which the socket is closed. */
// RAY_LOG(INFO) << "Socket Disconnected in read_message "
// << static_cast<int64_t>(MessageType::DisconnectClient);
RAY_LOG(INFO) << "Socket Disconnected in read_message "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this codepath isnt hit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is a legacy Ray thing which is replaced by client_connection.cc I believe

<< static_cast<int64_t>(MessageType::IntentionalDisconnectClient);
// RAY_LOG(INFO) << "[LocalSchedClient] Disconnecting Client "
// << static_cast<int64_t>(CommonMessageType::DISCONNECT_CLIENT);
write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient),
Copy link
Contributor Author

@richardliaw richardliaw Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason, introducing this redirection causes A worker died or was killed while executing task 000000005ea57cce1217c0fddb6bd1fb3fb879f6. to no longer show, even when non-intentionally disconnecting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there this is handled by a different event handler, so the behavior will be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated above comment; but actually, this is fine.

I was trying to kill -9 an idle actor and obtain the task killed stderr response, but nothing was occurring. kill -9 on a actor running a task will be fine. Shouldn't raise warnings at equal levels?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8350/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8352/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8353/
Test FAILed.

const std::shared_ptr<LocalClientConnection> &client) {
RAY_LOG(INFO) << "ProcessDisconnectClientMessage Started!";
// Remove the dead worker from the pool and stop listening for messages.
const std::shared_ptr<Worker> worker = worker_pool_.GetRegisteredWorker(client);

This comment was marked as resolved.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8378/
Test FAILed.

// This is sent from the local scheduler to a worker or driver.
RegisterClientReply,
// Notify the local scheduler that this client is disconnecting gracefully.
// Notify the local scheduler that this client is disconnecting unexpectedly.
Copy link
Collaborator

@robertnishihara robertnishihara Sep 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is disconnecting -> has disconnected

<< static_cast<int64_t>(MessageType::IntentionalDisconnectClient);
// RAY_LOG(INFO) << "[LocalSchedClient] Disconnecting Client "
// << static_cast<int64_t>(CommonMessageType::DISCONNECT_CLIENT);
write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient),

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

@richardliaw richardliaw changed the title [wip] Separate intentional disconnects in backend Separate intentional disconnects in backend Sep 30, 2018
write_message(conn->conn, static_cast<int64_t>(MessageType::DisconnectClient),
fbb.GetSize(), fbb.GetBufferPointer(), &conn->write_mutex);
if(conn->use_raylet) {
write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Moving conversation here): I was trying to kill -9 an idle actor and obtain the task killed stderr response, but nothing was occurring. kill -9 on a actor running a task will be fine. Shouldn't raise warnings at equal levels?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8480/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8481/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8493/
Test FAILed.

} break;
case protocol::MessageType::DisconnectClient: {
case protocol::MessageType::DisconnectClient:
case protocol::MessageType::IntentionalDisconnectClient: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is an accident, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to continue processing messages even when intentionally disconnecting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my mistake, I thought case protocol::MessageType::IntentionalDisconnectClient: appeared twice, but it actually appears once in ProcessClientMessage and once in ProcessNodeManagerMessage.

The "intentional disconnect" message should never be received from another node manager (only from workers/drivers), so this case here should not exist.

@robertnishihara
Copy link
Collaborator

Nice! This looks pretty good.

This PR interacts with #2997. I'd suggest merging #2997 first since it adds some relevant tests and also changes the semantics of ProcessDisconnectClientMessage a little.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8501/
Test PASSed.

// all the messages are ignored except DisconnectClient.
if (static_cast<protocol::MessageType>(message_type) !=
protocol::MessageType::DisconnectClient) {
if ((static_cast<protocol::MessageType>(message_type) !=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but static_cast<protocol::MessageType>(message_type) comes up in 3 places in this function I think. Maybe we can do it once earlier on?

} break;
case protocol::MessageType::IntentionalDisconnectClient: {
if (registered_worker) {
registered_worker->MarkDead();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed. Do we need it?

MarkDead I think is only used when a driver dies to essentially ignore messages from workers that were executing tasks for that driver.

However, by returning in this method and not calling client->ProcessMessages();`, I think that will happen automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the point of this is to avoid the warning message here:

// TODO(rkn): Define this constant somewhere else.
std::string type = "worker_died";
std::ostringstream error_message;
error_message << "A worker died or was killed while executing task " << task_id
<< ".";
RAY_CHECK_OK(gcs_client_->error_table().PushErrorToDriver(
job_id, type, error_message.str(), current_time_ms()));

that is raised in this code path (but avoided if we set it as dead):

const TaskID &task_id = worker->GetAssignedTaskId();
if (!task_id.is_nil() && !worker->IsDead()) {
// If the worker was killed intentionally, e.g., when the driver that created

Is there a better way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hm, can we just add a bool push_warning argument to ProcessDisconnectClientMessage?

@robertnishihara
Copy link
Collaborator

@richardliaw One thing you mentioned is that we should also push an error when a worker/actor dies that wasn't executing a task. Do you want to do that in this PR or a subsequent one?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8511/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8523/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8533/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8542/
Test PASSed.

@robertnishihara robertnishihara changed the title Separate intentional disconnects in backend Suppress errors when worker or driver intentionally disconnects. Oct 4, 2018
@robertnishihara robertnishihara merged commit 01bb073 into ray-project:master Oct 4, 2018
@robertnishihara robertnishihara deleted the fix_extra_error branch October 4, 2018 07:06
@raulchen
Copy link
Contributor

Hi @richardliaw @robertnishihara. I'd like to use this IntentionalDisconnect message to decide whether an actor should be reconstructed.

But I'm still confused about how this works. In raylet mode, it seems that the worker always send IntentionalDisconnect to backend. When will the worker send Disconnect? and where's the code?

Thanks.

@robertnishihara
Copy link
Collaborator

@raulchen IntentionalDisconnect is only sent when you manually call local_scheduler_client.disconnect() from Python, which only happens in the __ray_terminate__ method in actors.

Disconnect happens every time a read fails on a socket to a client. This is never sent manually, it is just done under the hood. It appears in several places in https://github.com/ray-project/ray/blob/master/src/ray/common/client_connection.cc, e.g.,

void ClientConnection<T>::ProcessMessage(const boost::system::error_code &error) {
if (error) {
read_type_ = static_cast<int64_t>(protocol::MessageType::DisconnectClient);
}

@raulchen
Copy link
Contributor

@robertnishihara thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants