-
Couldn't load subscription status.
- Fork 6.8k
Suppress errors when worker or driver intentionally disconnects. #2935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suppress errors when worker or driver intentionally disconnects. #2935
Conversation
|
Test PASSed. |
|
Test FAILed. |
|
OK some progress has been made here - but now the following error message doesn't show when unintentionally killing worker.
|
|
src/common/io.cc
Outdated
| /* Handle the case in which the socket is closed. */ | ||
| // RAY_LOG(INFO) << "Socket Disconnected in read_message " | ||
| // << static_cast<int64_t>(MessageType::DisconnectClient); | ||
| RAY_LOG(INFO) << "Socket Disconnected in read_message " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this codepath isnt hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is a legacy Ray thing which is replaced by client_connection.cc I believe
| << static_cast<int64_t>(MessageType::IntentionalDisconnectClient); | ||
| // RAY_LOG(INFO) << "[LocalSchedClient] Disconnecting Client " | ||
| // << static_cast<int64_t>(CommonMessageType::DISCONNECT_CLIENT); | ||
| write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some reason, introducing this redirection causes A worker died or was killed while executing task 000000005ea57cce1217c0fddb6bd1fb3fb879f6. to no longer show, even when non-intentionally disconnecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there this is handled by a different event handler, so the behavior will be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated above comment; but actually, this is fine.
I was trying to kill -9 an idle actor and obtain the task killed stderr response, but nothing was occurring. kill -9 on a actor running a task will be fine. Shouldn't raise warnings at equal levels?
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
| const std::shared_ptr<LocalClientConnection> &client) { | ||
| RAY_LOG(INFO) << "ProcessDisconnectClientMessage Started!"; | ||
| // Remove the dead worker from the pool and stop listening for messages. | ||
| const std::shared_ptr<Worker> worker = worker_pool_.GetRegisteredWorker(client); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
Test FAILed. |
| // This is sent from the local scheduler to a worker or driver. | ||
| RegisterClientReply, | ||
| // Notify the local scheduler that this client is disconnecting gracefully. | ||
| // Notify the local scheduler that this client is disconnecting unexpectedly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is disconnecting -> has disconnected
| << static_cast<int64_t>(MessageType::IntentionalDisconnectClient); | ||
| // RAY_LOG(INFO) << "[LocalSchedClient] Disconnecting Client " | ||
| // << static_cast<int64_t>(CommonMessageType::DISCONNECT_CLIENT); | ||
| write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient), |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| write_message(conn->conn, static_cast<int64_t>(MessageType::DisconnectClient), | ||
| fbb.GetSize(), fbb.GetBufferPointer(), &conn->write_mutex); | ||
| if(conn->use_raylet) { | ||
| write_message(conn->conn, static_cast<int64_t>(MessageType::IntentionalDisconnectClient), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Moving conversation here): I was trying to kill -9 an idle actor and obtain the task killed stderr response, but nothing was occurring. kill -9 on a actor running a task will be fine. Shouldn't raise warnings at equal levels?
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
src/ray/raylet/node_manager.cc
Outdated
| } break; | ||
| case protocol::MessageType::DisconnectClient: { | ||
| case protocol::MessageType::DisconnectClient: | ||
| case protocol::MessageType::IntentionalDisconnectClient: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is an accident, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to continue processing messages even when intentionally disconnecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my mistake, I thought case protocol::MessageType::IntentionalDisconnectClient: appeared twice, but it actually appears once in ProcessClientMessage and once in ProcessNodeManagerMessage.
The "intentional disconnect" message should never be received from another node manager (only from workers/drivers), so this case here should not exist.
|
Test PASSed. |
src/ray/raylet/node_manager.cc
Outdated
| // all the messages are ignored except DisconnectClient. | ||
| if (static_cast<protocol::MessageType>(message_type) != | ||
| protocol::MessageType::DisconnectClient) { | ||
| if ((static_cast<protocol::MessageType>(message_type) != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but static_cast<protocol::MessageType>(message_type) comes up in 3 places in this function I think. Maybe we can do it once earlier on?
src/ray/raylet/node_manager.cc
Outdated
| } break; | ||
| case protocol::MessageType::IntentionalDisconnectClient: { | ||
| if (registered_worker) { | ||
| registered_worker->MarkDead(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed. Do we need it?
MarkDead I think is only used when a driver dies to essentially ignore messages from workers that were executing tasks for that driver.
However, by returning in this method and not calling client->ProcessMessages();`, I think that will happen automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the point of this is to avoid the warning message here:
ray/src/ray/raylet/node_manager.cc
Lines 678 to 685 in 9c14050
| // TODO(rkn): Define this constant somewhere else. | |
| std::string type = "worker_died"; | |
| std::ostringstream error_message; | |
| error_message << "A worker died or was killed while executing task " << task_id | |
| << "."; | |
| RAY_CHECK_OK(gcs_client_->error_table().PushErrorToDriver( | |
| job_id, type, error_message.str(), current_time_ms())); | |
that is raised in this code path (but avoided if we set it as dead):
ray/src/ray/raylet/node_manager.cc
Lines 662 to 664 in 9c14050
| const TaskID &task_id = worker->GetAssignedTaskId(); | |
| if (!task_id.is_nil() && !worker->IsDead()) { | |
| // If the worker was killed intentionally, e.g., when the driver that created |
Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hm, can we just add a bool push_warning argument to ProcessDisconnectClientMessage?
|
@richardliaw One thing you mentioned is that we should also push an error when a worker/actor dies that wasn't executing a task. Do you want to do that in this PR or a subsequent one? |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Hi @richardliaw @robertnishihara. I'd like to use this But I'm still confused about how this works. In raylet mode, it seems that the worker always send Thanks. |
|
@raulchen
ray/src/ray/common/client_connection.cc Lines 151 to 154 in b82fd15
|
|
@robertnishihara thanks! |
What do these changes do?
TODO:
intentionaldisconnectRelated issue number
Addresses #2481