Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling #1848

Merged
merged 27 commits into from
Apr 16, 2018

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Apr 7, 2018

What do these changes do?

RLlib agents now declare their resource requests to Tune, so as a user you don't have to.

Also add a --queue-trials option to Tune, which will allow a Trial to be scheduled even if it somewhat exceeds the resource capacity of the cluster. This should allow RLlib to work with autoscaling clusters in many cases. Couple caveats:

  • If the cluster has 0 GPUs it won't queue, this is to avoid hangs forever if the cluster would not add more GPUs in the future.
  • It's possible some odd shaped resource requests will queue but not trigger auto-scaling, or remain infeasible after autoscaling (e.g. request num cpus greater than the max any single machine can offer). This can be mitigated by just using big machines and setting the autoscaling threshold lower.

Related issue number

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4715/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4716/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4717/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4718/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4719/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4720/
Test FAILed.

@richardliaw
Copy link
Contributor

Apex is failing

@ericl
Copy link
Contributor Author

ericl commented Apr 8, 2018

Should be fixed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4741/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Apr 9, 2018

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4746/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4754/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Apr 9, 2018

jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4756/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Apr 10, 2018

jenkins retest this please

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. A couple points:

  1. each alg now has a resource requirement - as a user it's very unclear what each configuration does, and how I would go about altering the resource configuration. It would be great to have documentation for each default_resource_request.

  2. I'm not sure what queue_trials does

@@ -48,14 +49,18 @@ def run_experiments(experiments, scheduler=None, with_server=False,
using the Client API.
server_port (int): Port number for launching TuneServer.
verbose (bool): How much output should be printed for each trial.
queue_trials (bool): Whether to queue trials when the cluster does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I read this multiple times, and I'm not sure what this means; aren't trials "queued" already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't though if there aren't enough resources. For example if you have one CPU and launch a trial with 4 CPUs then it won't be queued.

@@ -858,4 +859,5 @@ def result2(t, rew):


if __name__ == "__main__":
ray.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's since we now have to ray.get the trainable to inspect the resources

@@ -49,6 +50,10 @@ def __init__(self, scheduler=None, launch_web_server=False,
server_port (int): Port number for launching TuneServer
verbose (bool): Flag for verbosity. If False, trial results
will not be output.
queue_trials (bool): Whether to queue trials when the cluster does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm not sure what this means...

@@ -68,6 +69,14 @@ class A3CAgent(Agent):
_default_config = DEFAULT_CONFIG
_allow_unknown_subkeys = ["model", "optimizer", "env_config"]

@classmethod
def default_resource_request(cls, config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but in some sense adds a layer of extra complexity. I think if anything, every agent have some explanation of why these are set as defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the idea is the algorithm author deals with the complexity, not the users.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4812/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4861/
Test FAILed.

@richardliaw
Copy link
Contributor

so queued_trials are when you start a trial but the cluster doesn't have enough extra_cpus/extra_gpus, so it hangs until the autoscaler kicks in?

@ericl
Copy link
Contributor Author

ericl commented Apr 16, 2018

That's right

@ericl
Copy link
Contributor Author

ericl commented Apr 16, 2018

jenkins retest this please

@ericl
Copy link
Contributor Author

ericl commented Apr 16, 2018

Travis looks good. jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4946/
Test PASSed.

@richardliaw richardliaw merged commit 7ab890f into ray-project:master Apr 16, 2018
@richardliaw richardliaw deleted the auto-conf branch April 16, 2018 23:58
royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* master:
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)
  [tune] Polishing docs (ray-project#1846)
  [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848)
  Preemptively push local arguments for actor tasks (ray-project#1901)
  [tune] Allow fetching pinned objects from trainable functions (ray-project#1895)
  Multithreading refactor for ObjectManager. (ray-project#1911)
  Add slice functionality (ray-project#1832)
  [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894)
  Addresses missed comments from multichunk object transfer PR. (ray-project#1908)
  Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816)
  [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834)
  Fix UI issue for non-json-serializable task arguments. (ray-project#1892)
  Remove unnecessary calls to .hex() for object IDs. (ray-project#1910)
  Allow multiple raylets to be started on a single machine. (ray-project#1904)

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/ray/rllib/dqn/dqn.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants