Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] Add the ability to run arbitrary Python scripts with ray.tune #1132

Merged
merged 39 commits into from
Oct 18, 2017

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Oct 16, 2017

This builds on top of #1107 to also support evaluation of standalone Python scripts.

See the README for a detailed example of parallelizing evaluation of a MNIST script.

I also did some refactoring to move common classes from RLlib into the ray.tune dir, however this isn't completely finished so ray.tune still has some RLlib dependencies.

cc @richardliaw @pcmoritz

@@ -5,4 +5,4 @@ hopper-ppo:
resources:
cpu: 64
gpu: 4
config: {"gamma": 0.995, "kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": 1e-4, "sgd_batchsize": 32768, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 160000, "num_workers": 64}
config: {"gamma": 0.995, "kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": .0001, "sgd_batchsize": 32768, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 160000, "num_workers": 64}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, pyyaml has a bug where it can't parse scientific notation correctly.

x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))


def train(config={'activation': 'relu'}, reporter=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the integration with ray.tune

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2142/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2145/
Test FAILed.

@pcmoritz
Copy link
Contributor

Thanks, this is really useful (most people have standalone training scripts); will review it later today!

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2146/
Test PASSed.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

RUNNING = 'RUNNING'
TERMINATED = 'TERMINATED'
ERROR = 'ERROR'
PENDING = "PENDING"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the proper format for strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer double quotes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, we've mostly been using double quotes.

@@ -98,7 +97,7 @@ def stop(self, error=False):

try:
if self.agent:
self.agent.stop.remote()
ray.get(self.agent.stop.remote())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any tradeoffs to doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe it shouldn't block, I removed this.

self.last_result.training_iteration,
round(self.last_result.episode_reward_mean, 1))

pieces = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice - could use a dict that maps from formatting string to result and it would probably be cleaner.

@@ -32,16 +32,20 @@
# Extends the base parser defined in ray/tune/config_parser, to add some
# RLlib specific arguments. For more arguments, see the configuration
# defined there.
parser.add_argument("--redis-address", default=None, type=str,
parser.add_argument("--redis_address", default=None, type=str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html has redis-address, makes sense to standardize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

finer-grained control over trial setup and scheduling. Some examples of
calling ray.tune programmatically include:

- python/ray/tune/examples/tune_mnist_ray.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/ray/tune/examples/tune_mnist_ray.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(formatting `` would be nice)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

elif hasattr(importlib, "machinery"):
# Python 3.3
from importlib.machinery import SourceFileLoader
foo = SourceFileLoader("external_file", file_path).load_module()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: foo...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if i % 10 == 0:
train_accuracy = accuracy.eval(feed_dict={
x: batch[0], y_: batch[1], keep_prob: 1.0})
if status_reporter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to annotate this part to indicate specific modification of status reporter here.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2150/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2151/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2152/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2153/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2154/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Oct 18, 2017

I've seen this a few times in the failing tests cc @stephanie-wang


testManyLocalSchedulersDying (__main__.ActorReconstruction) ... Waiting for redis server at 127.0.0.1:22195 to respond...

Traceback (most recent call last):

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/workers/default_worker.py", line 94, in <module>

    ray.worker.global_worker.main_loop()

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 858, in main_loop

    self._wait_for_and_process_task(task)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 808, in _wait_for_and_process_task

    self._process_task(task)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 747, in _process_task

    e, traceback_str)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 772, in _handle_process_task_failure

    self._store_outputs_in_objstore(return_object_ids, failure_objects)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 686, in _store_outputs_in_objstore

    self.put_object(objectids[i], outputs[i])

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 339, in put_object

    self.store_and_register(object_id, value)

  File "/Users/travis/.local/lib/python2.7/site-packages/ray-0.2.1-py2.7-macosx-10.7-x86_64.egg/ray/worker.py", line 292, in store_and_register

    object_id.id()), self.serialization_context)

  File "plasma.pyx", line 395, in pyarrow.plasma.PlasmaClient.put

  File "plasma.pyx", line 301, in pyarrow.plasma.PlasmaClient.create

  File "error.pxi", line 79, in pyarrow.lib.check_status

ArrowIOError: Broken pipe

@robertnishihara
Copy link
Collaborator

Yeah, test failures all look innocuous to me.

@robertnishihara
Copy link
Collaborator

I'd suggest merging unless there are other changes you want to do here.

@richardliaw richardliaw merged commit 5a50e0e into ray-project:master Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants