Skip to content

Conversation

ericl
Copy link
Contributor

@ericl ericl commented May 26, 2018

Based on: #2125

What do these changes do?

This adds a handler in monitor.py to handle raylet heartbeats. These provide the resource load stats needed to run the autoscaler.

Tested with the following config:

cluster_name: test
min_workers: 4
max_workers: 4
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1a
auth:
    ssh_user: ubuntu
head_node:
    InstanceType: m4.xlarge
    ImageId: ami-6cac0313
worker_nodes:
    InstanceType: m4.xlarge
    ImageId: ami-6cac0313
file_mounts: {
    "/tmp/target": "/home/eric/Desktop/ray-private/.git/refs/heads/fix-autoscaler",
}
setup_commands:
    - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p27/bin:$PATH"' >> ~/.bashrc
    - cd ray; git reset --hard; git fetch && git checkout `cat /tmp/target`
    - diff /tmp/target /tmp/compiled || (cd ray; git clean -ffdx; cd python; python setup.py develop --user)
    - cp /tmp/target /tmp/compiled
head_setup_commands:
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions
worker_setup_commands:
    []
head_start_ray_commands:
    - source activate tensorflow_p27 && ray stop
    - source activate tensorflow_p27 && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --use-raylet
worker_start_ray_commands:
    - source activate tensorflow_p27 && ray stop
    - source activate tensorflow_p27 && ray start --redis-address=$RAY_HEAD_IP:6379 --use-raylet

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5645/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5718/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented May 30, 2018

Tested this again on latest and it seems to work.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5727/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5730/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5733/
Test PASSed.

return

clients = self.state.client_table()
if type(clients) is list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note the line

if self.use_raylet:
    return

which I added recently to avoid running this method for xray. We probably want to remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't see that. Removed this.

for client in client_list:
if (client["ClientType"] == "local_scheduler"
and not client["Deleted"]):
if self.use_raylet:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep local_schedulers() as part of the API? It seems somewhat redundant with client_table().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems like a nice helper method to have. I suppose after all the old code is deleted this will all look nicer.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5842/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5883/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5898/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5903/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Jun 7, 2018

ping

@ericl
Copy link
Contributor Author

ericl commented Jun 7, 2018

Hopefully, yapf 0.22.0 produces the same output as travis

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5911/
Test FAILed.

if (client["ClientType"] == "local_scheduler"
and not client["Deleted"]):
if self.use_raylet:
for client in clients:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably not do the right thing in the case where machines are removed from the cluster (because those machines will then appear multiple times in clients, I think.

However, given the way that the autoscaler uses this, it probably won't matter for the autoscaler.

and not client["Deleted"]):
if self.use_raylet:
for client in clients:
if "Resources" in client:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think every entry of clients is a raylet, so this if statement is unnecessary. @stephanie-wang is that correct?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5913/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Jun 7, 2018

@robertnishihara removed the call to local_schedulers() now. Retested and it seems to work.

@robertnishihara robertnishihara merged commit 100d8c2 into ray-project:master Jun 7, 2018
@robertnishihara robertnishihara deleted the fix-autoscaler branch June 7, 2018 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants