Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Monitor for Raylet processes #1831

Merged
merged 6 commits into from
Apr 6, 2018

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

This adds a C++ monitor, similar to monitor.py that listens for heartbeats from Raylet processes through the GCS. Raylet processes that do not respond within a timeout are marked as dead.

@@ -1112,11 +1115,35 @@ def start_monitor(redis_address, node_ip_address, stdout_file=None,
command.append("--autoscaling-config=" + str(autoscaling_config))
p = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
if cleanup:
all_processes[PROCESS_TYPE_WORKER].append(p)
all_processes[PROCESS_TYPE_MONITOR].append(p)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

# Location of the raylet executables.
RAYLET_MONITOR_EXECUTABLE = os.path.join(
os.path.abspath(os.path.dirname(__file__)),
"core/src/ray/raylet/raylet_monitor")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the executable also needs to be added to the ray_files = [ list in python/setup.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

};
return Subscribe(JobID::nil(), client_id_, notification_callback,
subscription_callback);
}

Status ClientTable::Disconnect() {
auto data = std::make_shared<ClientTableDataT>(local_client_);
data->is_insertion = true;
data->is_insertion = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this was a bug right?

Copy link
Contributor Author

@stephanie-wang stephanie-wang Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, apparently this path wasn't covered in the tests :( I can add a regression unit test.

@@ -435,6 +435,10 @@ class ClientTable : private Log<UniqueID, ClientTableData> {
/// \return Status
ray::Status Disconnect();

/// Mark a different client as disconnected. The client ID should never be
/// reused for a new client.
ray::Status MarkDisconnected(const ClientID &dead_client_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the \param and \return?

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I left some comments.


class Monitor {
public:
// Create a Raylet monitor attached to the given GCS address and port.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// -> /// for some places in this file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, thanks!


namespace raylet {

class Monitor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we include some documentation about the purpose of this class?

gcs::AsyncGcsClient gcs_client_;
int64_t heartbeat_timeout_;
boost::asio::deadline_timer heartbeat_timer_;
std::unordered_map<ClientID, int64_t, UniqueIDHasher> heartbeats_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the interpretation of the int64_t?

it->second--;
if (it->second == 0) {
if (dead_clients_.count(it->first) == 0) {
RAY_LOG(WARNING) << "Client timed out: " << it->first.hex();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just it->first (the hex() happens automatically)

if use_raylet:
start_raylet_monitor(redis_address,
stdout_file=monitor_stdout_file,
stderr_file=monitor_stderr_file,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want these to go to the same log files as monitor.py or to their own log file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to go to the same file as monitor.py, since only one of the monitors should ever be active at a given time (except for the autoscaler part of monitor.py).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4658/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4660/
Test PASSed.

@robertnishihara
Copy link
Collaborator

This test failure looks potentially related.

$ python test/monitor_test.py

testCleanupOnDriverExitManyRedisShards (__main__.MonitorTest) ... ok

testCleanupOnDriverExitSingleRedisShard (__main__.MonitorTest) ... Process Process-3:

Traceback (most recent call last):

  File "/Users/travis/miniconda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap

    self.run()

  File "/Users/travis/miniconda/lib/python2.7/multiprocessing/process.py", line 114, in run

    self._target(*self._args, **self._kwargs)

  File "test/monitor_test.py", line 47, in Driver

    if (2, 1, summary_start[2]) != StateSummary():

  File "test/monitor_test.py", line 31, in StateSummary

    obj_tbl_len = len(ray.global_state.object_table())

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-macosx-10.6-x86_64.egg/ray/experimental/state.py", line 236, in object_table

    self._object_table(binary_to_object_id(object_id_binary)))

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-macosx-10.6-x86_64.egg/ray/experimental/state.py", line 200, in _object_table

    result_table_response, 0)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-macosx-10.6-x86_64.egg/ray/core/generated/ResultTableReply.py", line 12, in GetRootAsResultTableReply

    n = flatbuffers.encode.Get(flatbuffers.packer.uoffset, buf, offset)

  File "/Users/travis/.local/lib/python2.7/site-packages/flatbuffers-2015.12.22.1-py2.7.egg/flatbuffers/encode.py", line 24, in Get

    return packer_type.unpack_from(memoryview_type(buf), head)[0]

TypeError: cannot make memory view because object does not have the buffer interface

FAIL



======================================================================

FAIL: testCleanupOnDriverExitSingleRedisShard (__main__.MonitorTest)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "test/monitor_test.py", line 91, in testCleanupOnDriverExitSingleRedisShard

    self._testCleanupOnDriverExit(num_redis_shards=1)

  File "test/monitor_test.py", line 82, in _testCleanupOnDriverExit

    self.assertEqual((0, 1), StateSummary()[:2])

AssertionError: Tuples differ: (0, 1) != (2, 2)



First differing element 0:

0

2



- (0, 1)

+ (2, 2)



----------------------------------------------------------------------

Ran 2 tests in 30.064s



FAILED (failures=1)

@stephanie-wang
Copy link
Contributor Author

Hmm, I'm sort of confused by this error because I don't think adding the new C++ monitor should mess with the original one (especially when the use_raylet flag isn't turned on). That error looks like it may be because there's an object in the object table, but no corresponding result table entry?

@robertnishihara
Copy link
Collaborator

Ah, looks like the error was already reported in #1332.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4676/
Test PASSed.

@robertnishihara robertnishihara merged commit cbf3181 into ray-project:master Apr 6, 2018
@robertnishihara robertnishihara deleted the xray-monitor branch April 6, 2018 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants