Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post #1708

Merged
merged 23 commits into from
Mar 15, 2018
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Example Use
Ray comes with libraries that accelerate deep learning and reinforcement learning development:

- `Ray Tune`_: Hyperparameter Optimization Framework
- `Ray RLlib`_: A Scalable Reinforcement Learning Library
- `Ray RLlib`_: Scalable Reinforcement Learning

.. _`Ray Tune`: http://ray.readthedocs.io/en/latest/tune.html
.. _`Ray RLlib`: http://ray.readthedocs.io/en/latest/rllib.html
Expand Down
3 changes: 3 additions & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,4 +320,7 @@
# pcmoritz: To make the following work, you have to run
# sudo pip install recommonmark

# Python methods should be presented in source code order
autodoc_member_order = 'bysource'

# see also http://searchvoidstar.tumblr.com/post/125486358368/making-pdfs-from-markdown-on-readthedocsorg-using
3 changes: 2 additions & 1 deletion doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ View the `codebase on GitHub`_.
Ray comes with libraries that accelerate deep learning and reinforcement learning development:

- `Ray Tune`_: Hyperparameter Optimization Framework
- `Ray RLlib`_: A Scalable Reinforcement Learning Library
- `Ray RLlib`_: Scalable Reinforcement Learning

.. _`Ray Tune`: tune.html
.. _`Ray RLlib`: rllib.html
Expand Down Expand Up @@ -78,6 +78,7 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
:caption: Ray RLlib

rllib.rst
rllib-optimizers.rst
rllib-dev.rst

.. toctree::
Expand Down
4 changes: 2 additions & 2 deletions doc/source/rllib-dev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ a common base class:
Policy Evaluators and Optimizers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: ray.rllib.optimizers.evaluator.Evaluator
.. autoclass:: ray.rllib.optimizers.policy_evaluator.PolicyEvaluator
:members:

.. autoclass:: ray.rllib.optimizers.optimizer.Optimizer
.. autoclass:: ray.rllib.optimizers.policy_optimizer.PolicyOptimizer
:members:

Sample Batches
Expand Down
51 changes: 51 additions & 0 deletions doc/source/rllib-optimizers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Using Policy Optimizers outside RLlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider just renaming to Policy Optimizers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

=====================================

RLlib supports using its distributed policy optimizer implementations from external algorithms.

Here are the steps for using a RLlib policy optimizer with an existing algorithm.

1. Implement the `Policy evaluator interface <rllib-dev.html#policy-evaluators-and-optimizers>`__.

- Here is an example of porting a `PyTorch Rainbow implementation <https://github.com/ericl/Rainbow/blob/rllib-example/rainbow_evaluator.py>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit code examples here in this page would be good too


- Another example porting a `TensorFlow DQN implementation <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/dqn_evaluator.py>`__.

2. Pick a `Policy optimizer class <https://github.com/ray-project/ray/tree/master/python/ray/rllib/optimizers>`__. The `LocalSyncOptimizer <https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/local_sync.py>`__ is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their ``make`` method (e.g., ``LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, conf)``), or you can construct them by passing in a list of evaluators instantiated as Ray actors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that would provide clarity is conf -> optimizer_config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


- Here is code showing the `simple Policy Gradient agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/pg/pg.py>`__ using ``make()``.

- A different example showing an `A3C agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__ passing in Ray actors directly.

3. Decide how you want to drive the training loop.

- Option 1: call ``optimizer.step()`` from some existing training code. Training statistics can be retrieved by querying the ``optimizer.local_evaluator`` evaluator instance, or mapping over the remote evaluators (e.g., ``ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])``) if you are running with multiple workers.

- Option 2: define a full RLlib `Agent class <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agent.py>`__. This might be preferable if you don't have an existing training harness or want to use features provided by `Ray Tune <tune.html>`__.


Policy Optimizers
-----------------

+-----------------------------+---------------------+-----------------+------------------------------+
| **Policy optimizer class** | **Operating range** | **Works with** | **Description** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just built the docs locally, and this table is quite hard to read, especially with the need to horizontally scroll. Maybe just use sections, then add hyperlinks to relevant examples that actually use each optimizer..

+=============================+=====================+=================+==============================+
|AsyncOptimizer |1-10s of CPUs |(any) |Asynchronous gradient-based |
| | | |optimization (e.g., A3C) |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalSyncOptimizer |0-1 GPUs + |(any) |Synchronous gradient-based |
| |1-100s of CPUs | |optimization with parallel |
| | | |sample collection |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalSyncReplayOptimizer |0-1 GPUs + | Off-policy |Adds a replay buffer |
| |1-100s of CPUs | algorithms |to LocalSyncOptimizer |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalMultiGPUOptimizer |0-10 GPUs + | Algorithms |Implements data-parallel |
| |1-100s of CPUs | written in |optimization over multiple |
| | | TensorFlow |GPUs, e.g., for PPO |
+-----------------------------+---------------------+-----------------+------------------------------+
|ApexOptimizer |1 GPU + | Off-policy |Implements the Ape-X |
| |10-100s of CPUs | algorithms |distributed prioritization |
| | | w/sample |algorithm |
| | | prioritization | |
+-----------------------------+---------------------+-----------------+------------------------------+
35 changes: 19 additions & 16 deletions doc/source/rllib.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,16 @@
Ray RLlib: A Scalable Reinforcement Learning Library
====================================================
Ray RLlib: Scalable Reinforcement Learning
==========================================

Ray RLlib is a reinforcement learning library that aims to provide both performance and composability:
Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed *policy optimizers* that make it easy to use a variety of training strategies with existing RL algorithms written in frameworks such as PyTorch, TensorFlow, and Theano. This enables complex architectures for RL training (e.g., Ape-X, IMPALA), to be implemented once and reused many times across different RL algorithms and libraries.

- Performance
- High performance algorithm implementions
- Pluggable distributed RL execution strategies
You can find the code for RLlib `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/rllib>`__, and the paper `here <https://arxiv.org/abs/1712.09381>`__.

- Composability
- Integration with the `Ray Tune <tune.html>`__ hyperparam tuning tool
- Support for multiple frameworks (TensorFlow, PyTorch)
- Scalable primitives for developing new algorithms
- Shared models between algorithms
.. note::

You can find the code for RLlib `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/rllib>`__, and the NIPS symposium paper `here <https://arxiv.org/abs/1712.09381>`__.
To use RLlib's policy optimizers outside of RLlib, see the `RLlib policy optimizers documentation <rllib-optimizers.html>`__.

RLlib currently provides the following algorithms:

RLlib's policy optimizers serve as the basis for RLlib's reference algorithms, which include:

- `Proximal Policy Optimization (PPO) <https://arxiv.org/abs/1707.06347>`__ which
is a proximal variant of `TRPO <https://arxiv.org/abs/1502.05477>`__.
Expand All @@ -24,6 +19,8 @@ RLlib currently provides the following algorithms:

- `Deep Q Networks (DQN) <https://arxiv.org/abs/1312.5602>`__.

- `Ape-X Distributed Prioritized Experience Replay <https://arxiv.org/abs/1803.00933>`__.

- Evolution Strategies, as described in `this
paper <https://arxiv.org/abs/1703.03864>`__. Our implementation
is adapted from
Expand Down Expand Up @@ -80,7 +77,7 @@ The ``train.py`` script has a number of options you can show by running
The most important options are for choosing the environment
with ``--env`` (any OpenAI gym environment including ones registered by the user
can be used) and for choosing the algorithm with ``--run``
(available options are ``PPO``, ``A3C``, ``ES`` and ``DQN``).
(available options are ``PPO``, ``A3C``, ``ES``, ``DQN`` and ``APEX``).

Specifying Parameters
~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -89,8 +86,9 @@ Each algorithm has specific hyperparameters that can be set with ``--config`` -
``DEFAULT_CONFIG`` variable in
`PPO <https://github.com/ray-project/ray/blob/master/python/ray/rllib/ppo/ppo.py>`__,
`A3C <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__,
`ES <https://github.com/ray-project/ray/blob/master/python/ray/rllib/es/es.py>`__ and
`DQN <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/dqn.py>`__.
`ES <https://github.com/ray-project/ray/blob/master/python/ray/rllib/es/es.py>`__,
`DQN <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/dqn.py>`__ and
`APEX <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/apex.py>`__.

In an example below, we train A3C by specifying 8 workers through the config flag.
function that creates the env to refer to it by name. The contents of the env_config agent config field will be passed to that function to allow the environment to be configured. The return type should be an OpenAI gym.Env. For example:
Expand Down Expand Up @@ -325,6 +323,11 @@ in the ``config`` section of the experiments.
For an advanced example of using Population Based Training (PBT) with RLlib,
see the `PPO + PBT Walker2D training example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_ppo_example.py>`__.

Using Policy Optimizers outside of RLlib
----------------------------------------

See the `RLlib policy optimizers documentation <rllib-optimizers.html>`__.

Contributing to RLlib
---------------------

Expand Down
27 changes: 17 additions & 10 deletions python/ray/rllib/README.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
Ray RLlib: A Scalable Reinforcement Learning Library
====================================================
Ray RLlib: Scalable Reinforcement Learning
==========================================

This README provides a brief technical overview of RLlib. See also the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__ and `NIPS symposium paper <https://arxiv.org/abs/1712.09381>`__.
This README provides a brief technical overview of RLlib. See also the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__ and `paper <https://arxiv.org/abs/1712.09381>`__.

RLlib currently provides the following algorithms:
Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed *policy optimizers* that make it easy to use a variety of training strategies with existing RL algorithms written in frameworks such as PyTorch, TensorFlow, and Theano. This enables complex architectures for RL training (e.g., Ape-X, IMPALA), to be implemented *once* and reused many times across different RL algorithms and libraries.

RLlib's policy optimizers serve as the basis for RLlib's reference algorithms, which include:

- `Proximal Policy Optimization (PPO) <https://arxiv.org/abs/1707.06347>`__ which
is a proximal variant of `TRPO <https://arxiv.org/abs/1502.05477>`__.
Expand All @@ -12,13 +14,17 @@ RLlib currently provides the following algorithms:

- `Deep Q Networks (DQN) <https://arxiv.org/abs/1312.5602>`__.

- `Ape-X Distributed Prioritized Experience Replay <https://arxiv.org/abs/1803.00933>`__.

- Evolution Strategies, as described in `this
paper <https://arxiv.org/abs/1703.03864>`__. Our implementation
is adapted from
`here <https://github.com/openai/evolution-strategies-starter>`__.

These algorithms can be run on any OpenAI Gym MDP, including custom ones written and registered by the user.

RLlib's distributed policy optimizers can also be used by any existing algorithm or RL library that implements the policy evaluator interface (optimizers/policy_evaluator.py).


Training API
------------
Expand All @@ -33,19 +39,20 @@ All RLlib algorithms implement a common training API (agent.py), which enables m
# Integration with ray.tune for hyperparam evaluation
python train.py -f tuned_examples/cartpole-grid-search-example.yaml

Policy Evaluator and Optimizer abstractions
-------------------------------------------
Policy Optimizer abstraction
----------------------------

RLlib's gradient-based algorithms are composed using two abstractions: Evaluators (evaluator.py) and Optimizers (optimizers/optimizer.py). Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
RLlib's gradient-based algorithms are composed using two abstractions: policy evaluators (optimizers/policy_evaluator.py) and policy optimizers (optimizers/policy_optimizer.py). Policy optimizers serve as the "control plane" of algorithms and implement a particular distributed optimization strategy for RL. Evaluators implement the algorithm "data plane" and encapsulate the model graph. Once an evaluator for an algorithm is implemented, it is compatible with any policy optimizer.

This pluggability enables optimization strategies to be re-used and improved across different algorithms and deep learning frameworks (RLlib's optimizers work with both TensorFlow and PyTorch, though currently only A3C has a PyTorch graph implementation).
This pluggability enables complex architectures for distributed training to be defined _once_ and reused many times across different algorithms and RL libraries.

These are the currently available optimizers:

- ``AsyncOptimizer`` is an asynchronous RL optimizer, i.e. like A3C. It asynchronously pulls and applies gradients from evaluators, sending updated weights back as needed.
- ``LocalSyncOptimizer`` is a simple synchronous RL optimizer. It pulls samples from remote evaluators, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote evalutaors.
- ``LocalMultiGPUOptimizer`` (currently available for PPO) This optimizer performs SGD over a number of local GPUs, and pins experience data in GPU memory to amortize the copy overhead for multiple SGD passes.
- ``AllReduceOptimizer`` (planned) This optimizer would use the Allreduce primitive to scalably synchronize weights among a number of remote GPU workers.
- ``LocalSyncReplayOptimizer`` adds experience replay to LocalSyncOptimizer (e.g., for DQNs).
- ``LocalMultiGPUOptimizer`` This optimizer performs SGD over a number of local GPUs, and pins experience data in GPU memory to amortize the copy overhead for multiple SGD passes.
- ``ApexOptimizer`` This implements the distributed experience replay algorithm for DQN and DDPG and is designed to run in a cluster setting.

Common utilities
----------------
Expand Down
7 changes: 2 additions & 5 deletions python/ray/rllib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,8 @@
def _register_all():
for key in ["PPO", "ES", "DQN", "APEX", "A3C", "BC", "PG", "__fake",
"__sigmoid_fake_data", "__parameter_tuning"]:
try:
from ray.rllib.agent import get_agent_class
register_trainable(key, get_agent_class(key))
except ImportError as e:
print("Warning: could not import {}: {}".format(key, e))
from ray.rllib.agent import get_agent_class
register_trainable(key, get_agent_class(key))


_register_all()
6 changes: 3 additions & 3 deletions python/ray/rllib/a3c/a3c_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@

import ray
from ray.rllib.models import ModelCatalog
from ray.rllib.optimizers import Evaluator
from ray.rllib.optimizers import PolicyEvaluator
from ray.rllib.a3c.common import get_policy_cls
from ray.rllib.utils.filter import get_filter
from ray.rllib.utils.sampler import AsyncSampler
from ray.rllib.utils.process_rollout import process_rollout


class A3CEvaluator(Evaluator):
class A3CEvaluator(PolicyEvaluator):
"""Actor object to start running simulation on workers.

The gradient computation is also executed from this object.
Expand Down Expand Up @@ -65,7 +65,7 @@ def get_completed_rollout_metrics(self):

def compute_gradients(self, samples):
gradient, info = self.policy.compute_gradients(samples)
return gradient
return gradient, {}

def apply_gradients(self, grads):
self.policy.apply_gradients(grads)
Expand Down
6 changes: 3 additions & 3 deletions python/ray/rllib/bc/bc_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
from ray.rllib.bc.experience_dataset import ExperienceDataset
from ray.rllib.bc.policy import BCPolicy
from ray.rllib.models import ModelCatalog
from ray.rllib.optimizers import Evaluator
from ray.rllib.optimizers import PolicyEvaluator


class BCEvaluator(Evaluator):
class BCEvaluator(PolicyEvaluator):
def __init__(self, registry, env_creator, config, logdir):
env = ModelCatalog.get_preprocessor_as_wrapper(registry, env_creator(
config["env_config"]), config["model"])
Expand All @@ -31,7 +31,7 @@ def compute_gradients(self, samples):
gradient, info = self.policy.compute_gradients(samples)
self.metrics_queue.put(
{"num_samples": info["num_samples"], "loss": info["loss"]})
return gradient
return gradient, {}

def apply_gradients(self, grads):
self.policy.apply_gradients(grads)
Expand Down
2 changes: 1 addition & 1 deletion python/ray/rllib/dqn/common/wrappers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from __future__ import print_function

from ray.rllib.models import ModelCatalog
from ray.rllib.dqn.common.atari_wrappers import wrap_deepmind
from ray.rllib.utils.atari_wrappers import wrap_deepmind


def wrap_dqn(registry, env, options, random_starts):
Expand Down
9 changes: 1 addition & 8 deletions python/ray/rllib/dqn/dqn.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
# === Parallelism ===
# Number of workers for collecting samples with. This only makes sense
# to increase if your environment is particularly slow to sample, or if
# you're using the Ape-X optimizer.
# you're using the Async or Ape-X optimizers.
num_workers=0,
# Whether to allocate GPUs for workers (if > 0).
num_gpus_per_worker=0,
Expand Down Expand Up @@ -221,13 +221,6 @@ def _train_stats(self, start_timestep):

return result

def _populate_replay_buffer(self):
if self.remote_evaluators:
for e in self.remote_evaluators:
e.sample.remote(no_replay=True)
else:
self.local_evaluator.sample(no_replay=True)

def _stop(self):
# workaround for https://github.com/ray-project/ray/issues/1516
for ev in self.remote_evaluators:
Expand Down
Loading