-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling #1848
Conversation
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Apex is failing |
Should be fixed |
Test FAILed. |
jenkins retest this please |
Test PASSed. |
Test FAILed. |
jenkins retest this please |
Test FAILed. |
jenkins retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. A couple points:
-
each alg now has a resource requirement - as a user it's very unclear what each configuration does, and how I would go about altering the resource configuration. It would be great to have documentation for each
default_resource_request
. -
I'm not sure what
queue_trials
does
@@ -48,14 +49,18 @@ def run_experiments(experiments, scheduler=None, with_server=False, | |||
using the Client API. | |||
server_port (int): Port number for launching TuneServer. | |||
verbose (bool): How much output should be printed for each trial. | |||
queue_trials (bool): Whether to queue trials when the cluster does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I read this multiple times, and I'm not sure what this means; aren't trials "queued" already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They aren't though if there aren't enough resources. For example if you have one CPU and launch a trial with 4 CPUs then it won't be queued.
@@ -858,4 +859,5 @@ def result2(t, rew): | |||
|
|||
|
|||
if __name__ == "__main__": | |||
ray.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's since we now have to ray.get the trainable to inspect the resources
@@ -49,6 +50,10 @@ def __init__(self, scheduler=None, launch_web_server=False, | |||
server_port (int): Port number for launching TuneServer | |||
verbose (bool): Flag for verbosity. If False, trial results | |||
will not be output. | |||
queue_trials (bool): Whether to queue trials when the cluster does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm not sure what this means...
@@ -68,6 +69,14 @@ class A3CAgent(Agent): | |||
_default_config = DEFAULT_CONFIG | |||
_allow_unknown_subkeys = ["model", "optimizer", "env_config"] | |||
|
|||
@classmethod | |||
def default_resource_request(cls, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense but in some sense adds a layer of extra complexity. I think if anything, every agent have some explanation of why these are set as defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the idea is the algorithm author deals with the complexity, not the users.
Test PASSed. |
Test FAILed. |
so queued_trials are when you start a trial but the cluster doesn't have enough extra_cpus/extra_gpus, so it hangs until the autoscaler kicks in? |
That's right |
jenkins retest this please |
Travis looks good. jenkins retest this please |
Test PASSed. |
* master: Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903) [tune] Polishing docs (ray-project#1846) [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848) Preemptively push local arguments for actor tasks (ray-project#1901) [tune] Allow fetching pinned objects from trainable functions (ray-project#1895) Multithreading refactor for ObjectManager. (ray-project#1911) Add slice functionality (ray-project#1832) [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894) Addresses missed comments from multichunk object transfer PR. (ray-project#1908) Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816) [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834) Fix UI issue for non-json-serializable task arguments. (ray-project#1892) Remove unnecessary calls to .hex() for object IDs. (ray-project#1910) Allow multiple raylets to be started on a single machine. (ray-project#1904) # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py
What do these changes do?
RLlib agents now declare their resource requests to Tune, so as a user you don't have to.
Also add a
--queue-trials
option to Tune, which will allow a Trial to be scheduled even if it somewhat exceeds the resource capacity of the cluster. This should allow RLlib to work with autoscaling clusters in many cases. Couple caveats:Related issue number