-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xray] Monitor for Raylet processes #1831
[xray] Monitor for Raylet processes #1831
Conversation
@@ -1112,11 +1115,35 @@ def start_monitor(redis_address, node_ip_address, stdout_file=None, | |||
command.append("--autoscaling-config=" + str(autoscaling_config)) | |||
p = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file) | |||
if cleanup: | |||
all_processes[PROCESS_TYPE_WORKER].append(p) | |||
all_processes[PROCESS_TYPE_MONITOR].append(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
# Location of the raylet executables. | ||
RAYLET_MONITOR_EXECUTABLE = os.path.join( | ||
os.path.abspath(os.path.dirname(__file__)), | ||
"core/src/ray/raylet/raylet_monitor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the executable also needs to be added to the ray_files = [
list in python/setup.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
}; | ||
return Subscribe(JobID::nil(), client_id_, notification_callback, | ||
subscription_callback); | ||
} | ||
|
||
Status ClientTable::Disconnect() { | ||
auto data = std::make_shared<ClientTableDataT>(local_client_); | ||
data->is_insertion = true; | ||
data->is_insertion = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this was a bug right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, apparently this path wasn't covered in the tests :( I can add a regression unit test.
src/ray/gcs/tables.h
Outdated
@@ -435,6 +435,10 @@ class ClientTable : private Log<UniqueID, ClientTableData> { | |||
/// \return Status | |||
ray::Status Disconnect(); | |||
|
|||
/// Mark a different client as disconnected. The client ID should never be | |||
/// reused for a new client. | |||
ray::Status MarkDisconnected(const ClientID &dead_client_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the \param
and \return
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I left some comments.
src/ray/raylet/monitor.h
Outdated
|
||
class Monitor { | ||
public: | ||
// Create a Raylet monitor attached to the given GCS address and port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//
-> ///
for some places in this file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, thanks!
|
||
namespace raylet { | ||
|
||
class Monitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we include some documentation about the purpose of this class?
src/ray/raylet/monitor.h
Outdated
gcs::AsyncGcsClient gcs_client_; | ||
int64_t heartbeat_timeout_; | ||
boost::asio::deadline_timer heartbeat_timer_; | ||
std::unordered_map<ClientID, int64_t, UniqueIDHasher> heartbeats_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the interpretation of the int64_t
?
src/ray/raylet/monitor.cc
Outdated
it->second--; | ||
if (it->second == 0) { | ||
if (dead_clients_.count(it->first) == 0) { | ||
RAY_LOG(WARNING) << "Client timed out: " << it->first.hex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just it->first
(the hex()
happens automatically)
if use_raylet: | ||
start_raylet_monitor(redis_address, | ||
stdout_file=monitor_stdout_file, | ||
stderr_file=monitor_stderr_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want these to go to the same log files as monitor.py
or to their own log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to go to the same file as monitor.py
, since only one of the monitors should ever be active at a given time (except for the autoscaler part of monitor.py
).
Test FAILed. |
Test PASSed. |
This test failure looks potentially related.
|
Hmm, I'm sort of confused by this error because I don't think adding the new C++ monitor should mess with the original one (especially when the |
Ah, looks like the error was already reported in #1332. |
Test PASSed. |
What do these changes do?
This adds a C++ monitor, similar to
monitor.py
that listens for heartbeats from Raylet processes through the GCS. Raylet processes that do not respond within a timeout are marked as dead.