Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XRay] Raylet node and object manager unification/backend redesign. #1640

Merged
merged 152 commits into from
Mar 8, 2018

Conversation

atumanov
Copy link
Contributor

@atumanov atumanov commented Mar 2, 2018

What do these changes do?

This PR is the first pass at redesigning and refactoring the Ray backend responsible for task, actor, and object management. We unify this functionality in a single new component, called Raylet, which is intended to combine the functionality of the current local scheduler and the object manager. The rewrite aims to follow object-oriented design, taking the opportunity to convert the backend to pure C++.

Significant changes:

  1. All backend event loops are to use boost::asio, with type safety.
  2. New component (Raylet) to replace local scheduler and object manager.
  3. Global scheduler is no longer assumed to be present.
  4. Moving GCS off the critical path, to fast-track common task placement patterns.
  5. All backend code follows OO design and is converted to pure C++.

Specific functionality in this PR:

  1. Maintain a pool of local workers.
  2. Schedule and assign a task based on current node resources.
  3. Schedule a task with local and remote object dependencies (no failure handling).
  4. Push an object from one object manager to another (no failure handling, single-threaded, mocked object directory).
  5. Pull an object from one object manager to another (no failure handling, single-threaded, mocked object directory).

Planned functionality not in this PR:

  1. Implement direct node manager to node manager communication for task placement.
  2. Turn on and use Lineage Cache on each node manager to locally cache task and object control information as well as forward it to a remote node manager for task execution without pushing it to GCS.

@robertnishihara
Copy link
Collaborator

cc @imzhenyu

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4073/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4082/
Test PASSed.

/// Get a client's information from the cache.
///
/// @param client The client to get information about.
const ClientInformation &GetClientInformation(ClientID client);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const ClientID& client

@@ -0,0 +1,134 @@
#ifndef RAY_OBJECTDIRECTORY_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_OBJECT_DIRECTORY_H

private:

/// Reference to the gcs client.
std::shared_ptr<GcsClient> gcs_client;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcs_client_

ObjectDirectory::~ObjectDirectory() = default;

ObjectDirectory::ObjectDirectory(std::shared_ptr<GcsClient> gcs_client){
this->gcs_client = gcs_client;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after gcs_client -> gcs_client_, can remove the this-> everywhere.

/// \param object_id The object id that was put into the store.
/// \param client_id The client id corresponding to this node.
/// \return Status of whether this method succeeded.
virtual ray::Status ObjectAdded(const ObjectID &object_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this ReportObjectAdded?

shared_ptr<ray::GcsClient> gcs_client)
: od(new ObjectDirectory(gcs_client)),
work_(io_service_) {
this->store_client_ = unique_ptr<ObjectStoreClient>(new ObjectStoreClient(io_service, config.store_socket_name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the this->


ActorInformation::~ActorInformation() {}

const ActorID& ActorInformation::GetActorId() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be consistent with Id and ID


namespace ray {

class NodeManager : public ClientManager<boost::asio::local::stream_protocol> {
Copy link
Contributor

@pcmoritz pcmoritz Mar 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to LocalScheduler? And "Raylet" is what we used to call the "NodeManager".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this was the way we originally had it but I'm not sure now what the reasoning was for changing it. @atumanov?

Copy link
Contributor Author

@atumanov atumanov Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) The notion of "local" scheduler is being outdated with this rewrite, because there's no "global" scheduler, which makes the term ambiguous and suggestive of the fact there's something other than the "local" scheduler.
(2) I think we might simply be used to the term and, naturally, have affinity for calling this component as it was named before. If we take a step back, I think this component is doing more than just task scheduling. Scheduling is only part of the NodeManager's responsibility. In fact, the scheduler is a component inside the NodeManager, periodically invoked to make scheduling decisions.

I felt like the NodeManager is more appropriate, as a more general and descriptive term capturing the fact that this component does more than just scheduling. It performs functions related to the management of the local node. Raylet is the name of the whole process that includes node management, object management, etc.

Let me know if this works for you, @pcmoritz .

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4085/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4084/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4163/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4164/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4177/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4178/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4180/
Test PASSed.

ray::Status status_code = ray::Status::OK();
if (existing_requests_.count(object_id) == 0) {
existing_requests_[object_id] = ODCallbacks({success_cb, fail_cb});
;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this semicolon do anything?

/// becomes available.
/// @param done_callback Callback to be called when subscription is installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but "becomes available" should be unindented (other places in this file as well)


ray::Status ObjectDirectory::GetInformation(const ClientID &client_id,
const InfoSuccessCallback &success_cb,
const InfoFailureCallback &fail_cb) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer success_callback and fail_callback here (and other places)


ray::Status ObjectDirectory::GetLocationsComplete(
const ray::Status &status, const ObjectID &object_id,
const std::vector<RemoteConnectionInfo> &v) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this something other than v?

namespace ray {

struct RemoteConnectionInfo {
RemoteConnectionInfo(const ClientID &id, const std::string &ipaddr, uint16_t portnum)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using ip_address in most places n the codebase. Also portnum -> port_num

}

ray::Status ClientTable::Add(const ClientID &client_id, const std::string &ip,
uint16_t port, DoneCallback done) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done -> done_callback

Raylet::Raylet(boost::asio::io_service& io_service,
const std::string &socket_name,
const ResourceSet &resource_config,
const ObjectManagerConfig &om_config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

om_config -> object_manager_config

bool SubtractResources(const ResourceSet &other);

/// Return the capacity value associated with the specified resource.
/// \param[in] resource_name: Resource name for which capacity is requested.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't using \param[in] in very many places, what does it mean?

/// Create a task execution specification.
///
/// \param execution_dependencies The task's dependencies, determined at
/// execution time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unindent (also other places in this file)

#include <cstddef>
#include <string>
#include <vector>
#include <unordered_map>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u comes before v :)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4183/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4184/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4201/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4202/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4203/
Test PASSed.

@pcmoritz pcmoritz self-requested a review March 8, 2018 19:54
@stephanie-wang stephanie-wang merged commit 91464a5 into master Mar 8, 2018
@robertnishihara robertnishihara deleted the xray branch March 8, 2018 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants