Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Start actor workers in parallel #2168

Merged

Conversation

pcmoritz
Copy link
Contributor

What do these changes do?

Related issue number

@pcmoritz pcmoritz changed the title [WIP] Start actor workers in parallel [WIP] [xray] Start actor workers in parallel May 30, 2018
@@ -622,6 +622,10 @@ void NodeManager::AssignTask(Task &task) {
// There are no more non-actor workers available to execute this task.
// Start a new worker.
worker_pool_.StartWorker();
} else if (spec.IsActorCreationTask()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't actor creation tasks already go through the if statement, so they won't hit the else if?

} else if (spec.IsActorCreationTask()) {
// We will need a new worker for this actor, so force start one.
// This parallelizes starting lots of actor workers.
worker_pool_.StartWorker(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this parallelize starting actor workers? It seems to be serial.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5739/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5761/
Test FAILed.

@pcmoritz pcmoritz force-pushed the start-actor-workers-parallel branch from 7c87761 to 75aa7c0 Compare May 31, 2018 06:09
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5766/
Test PASSed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented May 31, 2018

With the following program:

import ray
import time

ray.init(use_raylet=True)

@ray.remote
class Actor(object):
    def __init__(self):
        pass
    def f(self):
        pass

%time As = [Actor.remote() for i in range(64)]; ray.get([A.f.remote() for A in As])

On a m4.4xlarge (8 physical cores) we get these timings:

  • Without the PR: 27.1 s
  • With the PR: 3.4 s

cc @stephanie-wang

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5772/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5776/
Test PASSed.

@pcmoritz pcmoritz changed the title [WIP] [xray] Start actor workers in parallel [xray] Start actor workers in parallel May 31, 2018
if (!started_worker_pids_.empty() && !force_start) {
// The first condition makes sure that we are always starting up to
// std::thread::hardware_concurrency() number of processes in parallel.
if (NumStartedWorkers() > std::thread::hardware_concurrency() && !force_start) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of std::thread::hardware_concurrency(), let's use num_cpus

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5780/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5782/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5781/
Test PASSed.

@pcmoritz
Copy link
Contributor Author

retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5783/
Test PASSed.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few small comments.

@@ -155,7 +158,7 @@ bool WorkerPool::DisconnectWorker(std::shared_ptr<Worker> worker) {
// Protected WorkerPool methods.
void WorkerPool::AddStartedWorker(pid_t pid) { started_worker_pids_.insert(pid); }

uint32_t WorkerPool::NumStartedWorkers() const { return started_worker_pids_.size(); }
int WorkerPool::NumStartedWorkers() const { return started_worker_pids_.size(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this to an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://google.github.io/styleguide/cppguide.html, "On Unsigned Integers"

In this case it was necessary to avoid a signed/unsigned mismatch.

In general it's better to use signed ints, unless there is a good reason not to (i.e. for doing bit manipulations or modular arithmetic).

if (!started_worker_pids_.empty() && !force_start) {
// The first condition makes sure that we are always starting up to
// num_cpus_ number of processes in parallel.
if (NumStartedWorkers() > num_cpus_ && !force_start) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the documentation for StartWorker to account for the new logic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the documentation should also make clear that StartWorker may not actually start a worker.

@@ -116,6 +116,12 @@ bool ResourceSet::GetResource(const std::string &resource_name, double *value) c
return true;
}

double ResourceSet::GetNumCPUs() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple other places in the code where we have to do this logic (I think in node_manager.cc, maybe elsewhere too). Can you find and replace that code with the GetNumCPUs method?

/// Return the number of CPUs.
///
/// \return Number of CPUs.
double GetNumCPUs() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the convention we've been using is to lower-case acronyms, so it should be GetNumCpus.

@@ -70,7 +70,9 @@ NodeManager::NodeManager(boost::asio::io_service &io_service,
heartbeat_timer_(io_service),
heartbeat_period_ms_(config.heartbeat_period_ms),
local_resources_(config.resource_config),
worker_pool_(config.num_initial_workers, config.worker_command),
worker_pool_(config.num_initial_workers,
static_cast<int>(config.resource_config.GetNumCPUs()) + 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +1?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5789/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5812/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jun 1, 2018

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5817/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Jun 1, 2018

Jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5819/
Test PASSed.

@robertnishihara robertnishihara merged commit e1024d8 into ray-project:master Jun 2, 2018
@robertnishihara robertnishihara deleted the start-actor-workers-parallel branch June 2, 2018 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants