ray-project · ericl · Mar 15, 2018 · Mar 13, 2018 · Mar 13, 2018 · Mar 13, 2018
diff --git a/README.rst b/README.rst
@@ -38,7 +38,7 @@ Example Use
 Ray comes with libraries that accelerate deep learning and reinforcement learning development:
 
 - `Ray Tune`_: Hyperparameter Optimization Framework
-- `Ray RLlib`_: A Scalable Reinforcement Learning Library
+- `Ray RLlib`_: Scalable Reinforcement Learning
 
 .. _`Ray Tune`: http://ray.readthedocs.io/en/latest/tune.html
 .. _`Ray RLlib`: http://ray.readthedocs.io/en/latest/rllib.html

@@ -320,4 +320,7 @@
 # pcmoritz: To make the following work, you have to run
 # sudo pip install recommonmark
 
+# Python methods should be presented in source code order
+autodoc_member_order = 'bysource'
+
 # see also http://searchvoidstar.tumblr.com/post/125486358368/making-pdfs-from-markdown-on-readthedocsorg-using
@@ -41,7 +41,7 @@ View the `codebase on GitHub`_.
 Ray comes with libraries that accelerate deep learning and reinforcement learning development:
 
 - `Ray Tune`_: Hyperparameter Optimization Framework
-- `Ray RLlib`_: A Scalable Reinforcement Learning Library
+- `Ray RLlib`_: Scalable Reinforcement Learning
 
 .. _`Ray Tune`: tune.html
 .. _`Ray RLlib`: rllib.html
@@ -78,6 +78,7 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
    :caption: Ray RLlib
 
    rllib.rst
+   rllib-optimizers.rst
    rllib-dev.rst
 
 .. toctree::

@@ -42,10 +42,10 @@ a common base class:
 Policy Evaluators and Optimizers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. autoclass:: ray.rllib.optimizers.evaluator.Evaluator
+.. autoclass:: ray.rllib.optimizers.policy_evaluator.PolicyEvaluator
     :members:
 
-.. autoclass:: ray.rllib.optimizers.optimizer.Optimizer
+.. autoclass:: ray.rllib.optimizers.policy_optimizer.PolicyOptimizer
     :members:
 
 Sample Batches

diff --git a/doc/source/rllib-optimizers.rst b/doc/source/rllib-optimizers.rst
@@ -0,0 +1,51 @@
+Using Policy Optimizers outside RLlib
+=====================================
+
+RLlib supports using its distributed policy optimizer implementations from external algorithms.
+
+Here are the steps for using a RLlib policy optimizer with an existing algorithm.
+
+1. Implement the `Policy evaluator interface <rllib-dev.html#policy-evaluators-and-optimizers>`__.
+
+    - Here is an example of porting a `PyTorch Rainbow implementation <https://github.com/ericl/Rainbow/blob/rllib-example/rainbow_evaluator.py>`__.
+
+    - Another example porting a `TensorFlow DQN implementation <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/dqn_evaluator.py>`__.
+
+2. Pick a `Policy optimizer class <https://github.com/ray-project/ray/tree/master/python/ray/rllib/optimizers>`__. The `LocalSyncOptimizer <https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/local_sync.py>`__ is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their ``make`` method (e.g., ``LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, conf)``), or you can construct them by passing in a list of evaluators instantiated as Ray actors.
+
+    - Here is code showing the `simple Policy Gradient agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/pg/pg.py>`__ using ``make()``.
+
+    - A different example showing an `A3C agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__ passing in Ray actors directly.
+
+3. Decide how you want to drive the training loop.
+
+    - Option 1: call ``optimizer.step()`` from some existing training code. Training statistics can be retrieved by querying the ``optimizer.local_evaluator`` evaluator instance, or mapping over the remote evaluators (e.g., ``ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])``) if you are running with multiple workers.
+
+    - Option 2: define a full RLlib `Agent class <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agent.py>`__. This might be preferable if you don't have an existing training harness or want to use features provided by `Ray Tune <tune.html>`__.
+
+
+Policy Optimizers
+-----------------
+
++-----------------------------+---------------------+-----------------+------------------------------+
+| **Policy optimizer class**  | **Operating range** | **Works with**  | **Description**              |
++=============================+=====================+=================+==============================+
+|AsyncOptimizer               |1-10s of CPUs        |(any)            |Asynchronous gradient-based   |
+|                             |                     |                 |optimization (e.g., A3C)      |
++-----------------------------+---------------------+-----------------+------------------------------+
+|LocalSyncOptimizer           |0-1 GPUs +           |(any)            |Synchronous gradient-based    |
+|                             |1-100s of CPUs       |                 |optimization with parallel    |
+|                             |                     |                 |sample collection             |
++-----------------------------+---------------------+-----------------+------------------------------+
+|LocalSyncReplayOptimizer     |0-1 GPUs +           | Off-policy      |Adds a replay buffer          |
+|                             |1-100s of CPUs       | algorithms      |to LocalSyncOptimizer         |
++-----------------------------+---------------------+-----------------+------------------------------+
+|LocalMultiGPUOptimizer       |0-10 GPUs +          | Algorithms      |Implements data-parallel      |
+|                             |1-100s of CPUs       | written in      |optimization over multiple    |
+|                             |                     | TensorFlow      |GPUs, e.g., for PPO           |
++-----------------------------+---------------------+-----------------+------------------------------+
+|ApexOptimizer                |1 GPU +              | Off-policy      |Implements the Ape-X          |
+|                             |10-100s of CPUs      | algorithms      |distributed prioritization    |
+|                             |                     | w/sample        |algorithm                     |
+|                             |                     | prioritization  |                              |
++-----------------------------+---------------------+-----------------+------------------------------+
@@ -1,21 +1,16 @@
-Ray RLlib: A Scalable Reinforcement Learning Library
-====================================================
+Ray RLlib: Scalable Reinforcement Learning
+==========================================
 
-Ray RLlib is a reinforcement learning library that aims to provide both performance and composability:
+Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed *policy optimizers* that make it easy to use a variety of training strategies with existing RL algorithms written in frameworks such as PyTorch, TensorFlow, and Theano. This enables complex architectures for RL training (e.g., Ape-X, IMPALA), to be implemented once and reused many times across different RL algorithms and libraries.
 
-- Performance
-    - High performance algorithm implementions
-    - Pluggable distributed RL execution strategies
+You can find the code for RLlib `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/rllib>`__, and the paper `here <https://arxiv.org/abs/1712.09381>`__.
 
-- Composability
-    - Integration with the `Ray Tune <tune.html>`__ hyperparam tuning tool
-    - Support for multiple frameworks (TensorFlow, PyTorch)
-    - Scalable primitives for developing new algorithms
-    - Shared models between algorithms
+.. note::
 
-You can find the code for RLlib `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/rllib>`__, and the NIPS symposium paper `here <https://arxiv.org/abs/1712.09381>`__.
+    To use RLlib's policy optimizers outside of RLlib, see the `RLlib policy optimizers documentation <rllib-optimizers.html>`__.
 
-RLlib currently provides the following algorithms:
+
+RLlib's policy optimizers serve as the basis for RLlib's reference algorithms, which include:
 
 -  `Proximal Policy Optimization (PPO) <https://arxiv.org/abs/1707.06347>`__ which
    is a proximal variant of `TRPO <https://arxiv.org/abs/1502.05477>`__.
@@ -24,6 +19,8 @@ RLlib currently provides the following algorithms:
 
 - `Deep Q Networks (DQN) <https://arxiv.org/abs/1312.5602>`__.
 
+- `Ape-X Distributed Prioritized Experience Replay <https://arxiv.org/abs/1803.00933>`__.
+
 -  Evolution Strategies, as described in `this
    paper <https://arxiv.org/abs/1703.03864>`__. Our implementation
    is adapted from
@@ -80,7 +77,7 @@ The ``train.py`` script has a number of options you can show by running
 The most important options are for choosing the environment
 with ``--env`` (any OpenAI gym environment including ones registered by the user
 can be used) and for choosing the algorithm with ``--run``
-(available options are ``PPO``, ``A3C``, ``ES`` and ``DQN``).
+(available options are ``PPO``, ``A3C``, ``ES``, ``DQN`` and ``APEX``).
 
 Specifying Parameters
 ~~~~~~~~~~~~~~~~~~~~~
@@ -89,8 +86,9 @@ Each algorithm has specific hyperparameters that can be set with ``--config`` -
 ``DEFAULT_CONFIG`` variable in
 `PPO <https://github.com/ray-project/ray/blob/master/python/ray/rllib/ppo/ppo.py>`__,
 `A3C <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__,
-`ES <https://github.com/ray-project/ray/blob/master/python/ray/rllib/es/es.py>`__ and
-`DQN <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/dqn.py>`__.
+`ES <https://github.com/ray-project/ray/blob/master/python/ray/rllib/es/es.py>`__,
+`DQN <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/dqn.py>`__ and
+`APEX <https://github.com/ray-project/ray/blob/master/python/ray/rllib/dqn/apex.py>`__.
 
 In an example below, we train A3C by specifying 8 workers through the config flag.
 function that creates the env to refer to it by name. The contents of the env_config agent config field will be passed to that function to allow the environment to be configured. The return type should be an OpenAI gym.Env. For example:
@@ -325,6 +323,11 @@ in the ``config`` section of the experiments.
 For an advanced example of using Population Based Training (PBT) with RLlib,
 see the `PPO + PBT Walker2D training example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_ppo_example.py>`__.
 
+Using Policy Optimizers outside of RLlib
+----------------------------------------
+
+See the `RLlib policy optimizers documentation <rllib-optimizers.html>`__.
+
 Contributing to RLlib
 ---------------------
 

diff --git a/python/ray/rllib/README.rst b/python/ray/rllib/README.rst
@@ -1,9 +1,11 @@
-Ray RLlib: A Scalable Reinforcement Learning Library
-====================================================
+Ray RLlib: Scalable Reinforcement Learning
+==========================================
 
-This README provides a brief technical overview of RLlib. See also the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__ and `NIPS symposium paper <https://arxiv.org/abs/1712.09381>`__.
+This README provides a brief technical overview of RLlib. See also the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__ and `paper <https://arxiv.org/abs/1712.09381>`__.
 
-RLlib currently provides the following algorithms:
+Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed *policy optimizers* that make it easy to use a variety of training strategies with existing RL algorithms written in frameworks such as PyTorch, TensorFlow, and Theano. This enables complex architectures for RL training (e.g., Ape-X, IMPALA), to be implemented *once* and reused many times across different RL algorithms and libraries.
+
+RLlib's policy optimizers serve as the basis for RLlib's reference algorithms, which include:
 
 -  `Proximal Policy Optimization (PPO) <https://arxiv.org/abs/1707.06347>`__ which
    is a proximal variant of `TRPO <https://arxiv.org/abs/1502.05477>`__.
@@ -12,13 +14,17 @@ RLlib currently provides the following algorithms:
 
 - `Deep Q Networks (DQN) <https://arxiv.org/abs/1312.5602>`__.
 
+- `Ape-X Distributed Prioritized Experience Replay <https://arxiv.org/abs/1803.00933>`__.
+
 -  Evolution Strategies, as described in `this
    paper <https://arxiv.org/abs/1703.03864>`__. Our implementation
    is adapted from
    `here <https://github.com/openai/evolution-strategies-starter>`__.
 
 These algorithms can be run on any OpenAI Gym MDP, including custom ones written and registered by the user.
 
+RLlib's distributed policy optimizers can also be used by any existing algorithm or RL library that implements the policy evaluator interface (optimizers/policy_evaluator.py).
+
 
 Training API
 ------------
@@ -33,19 +39,20 @@ All RLlib algorithms implement a common training API (agent.py), which enables m
     # Integration with ray.tune for hyperparam evaluation
     python train.py -f tuned_examples/cartpole-grid-search-example.yaml
 
-Policy Evaluator and Optimizer abstractions
--------------------------------------------
+Policy Optimizer abstraction
+----------------------------
 
-RLlib's gradient-based algorithms are composed using two abstractions: Evaluators (evaluator.py) and Optimizers (optimizers/optimizer.py). Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
+RLlib's gradient-based algorithms are composed using two abstractions: policy evaluators (optimizers/policy_evaluator.py) and policy optimizers (optimizers/policy_optimizer.py). Policy optimizers serve as the "control plane" of algorithms and implement a particular distributed optimization strategy for RL. Evaluators implement the algorithm "data plane" and encapsulate the model graph. Once an evaluator for an algorithm is implemented, it is compatible with any policy optimizer.
 
-This pluggability enables optimization strategies to be re-used and improved across different algorithms and deep learning frameworks (RLlib's optimizers work with both TensorFlow and PyTorch, though currently only A3C has a PyTorch graph implementation).
+This pluggability enables complex architectures for distributed training to be defined _once_ and reused many times across different algorithms and RL libraries.
 
 These are the currently available optimizers:
 
 -  ``AsyncOptimizer`` is an asynchronous RL optimizer, i.e. like A3C. It asynchronously pulls and applies gradients from evaluators, sending updated weights back as needed.
 -  ``LocalSyncOptimizer`` is a simple synchronous RL optimizer. It pulls samples from remote evaluators, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote evalutaors.
--  ``LocalMultiGPUOptimizer`` (currently available for PPO) This optimizer performs SGD over a number of local GPUs, and pins experience data in GPU memory to amortize the copy overhead for multiple SGD passes.
--  ``AllReduceOptimizer`` (planned) This optimizer would use the Allreduce primitive to scalably synchronize weights among a number of remote GPU workers.
+-  ``LocalSyncReplayOptimizer`` adds experience replay to LocalSyncOptimizer (e.g., for DQNs).
+-  ``LocalMultiGPUOptimizer`` This optimizer performs SGD over a number of local GPUs, and pins experience data in GPU memory to amortize the copy overhead for multiple SGD passes.
+-  ``ApexOptimizer`` This implements the distributed experience replay algorithm for DQN and DDPG and is designed to run in a cluster setting.
 
 Common utilities
 ----------------

diff --git a/python/ray/rllib/__init__.py b/python/ray/rllib/__init__.py
@@ -10,11 +10,8 @@
 def _register_all():
     for key in ["PPO", "ES", "DQN", "APEX", "A3C", "BC", "PG", "__fake",
                 "__sigmoid_fake_data", "__parameter_tuning"]:
-        try:
-            from ray.rllib.agent import get_agent_class
-            register_trainable(key, get_agent_class(key))
-        except ImportError as e:
-            print("Warning: could not import {}: {}".format(key, e))
+        from ray.rllib.agent import get_agent_class
+        register_trainable(key, get_agent_class(key))
 
 
 _register_all()
diff --git a/python/ray/rllib/a3c/a3c_evaluator.py b/python/ray/rllib/a3c/a3c_evaluator.py
@@ -6,14 +6,14 @@
 
 import ray
 from ray.rllib.models import ModelCatalog
-from ray.rllib.optimizers import Evaluator
+from ray.rllib.optimizers import PolicyEvaluator
 from ray.rllib.a3c.common import get_policy_cls
 from ray.rllib.utils.filter import get_filter
 from ray.rllib.utils.sampler import AsyncSampler
 from ray.rllib.utils.process_rollout import process_rollout
 
 
-class A3CEvaluator(Evaluator):
+class A3CEvaluator(PolicyEvaluator):
     """Actor object to start running simulation on workers.
 
     The gradient computation is also executed from this object.
@@ -65,7 +65,7 @@ def get_completed_rollout_metrics(self):
 
     def compute_gradients(self, samples):
         gradient, info = self.policy.compute_gradients(samples)
-        return gradient
+        return gradient, {}
 
     def apply_gradients(self, grads):
         self.policy.apply_gradients(grads)

diff --git a/python/ray/rllib/bc/bc_evaluator.py b/python/ray/rllib/bc/bc_evaluator.py
@@ -9,10 +9,10 @@
 from ray.rllib.bc.experience_dataset import ExperienceDataset
 from ray.rllib.bc.policy import BCPolicy
 from ray.rllib.models import ModelCatalog
-from ray.rllib.optimizers import Evaluator
+from ray.rllib.optimizers import PolicyEvaluator
 
 
-class BCEvaluator(Evaluator):
+class BCEvaluator(PolicyEvaluator):
     def __init__(self, registry, env_creator, config, logdir):
         env = ModelCatalog.get_preprocessor_as_wrapper(registry, env_creator(
             config["env_config"]), config["model"])
@@ -31,7 +31,7 @@ def compute_gradients(self, samples):
         gradient, info = self.policy.compute_gradients(samples)
         self.metrics_queue.put(
             {"num_samples": info["num_samples"], "loss": info["loss"]})
-        return gradient
+        return gradient, {}
 
     def apply_gradients(self, grads):
         self.policy.apply_gradients(grads)

diff --git a/python/ray/rllib/dqn/common/wrappers.py b/python/ray/rllib/dqn/common/wrappers.py
@@ -3,7 +3,7 @@
 from __future__ import print_function
 
 from ray.rllib.models import ModelCatalog
-from ray.rllib.dqn.common.atari_wrappers import wrap_deepmind
+from ray.rllib.utils.atari_wrappers import wrap_deepmind
 
 
 def wrap_dqn(registry, env, options, random_starts):

diff --git a/python/ray/rllib/dqn/dqn.py b/python/ray/rllib/dqn/dqn.py
@@ -103,7 +103,7 @@
     # === Parallelism ===
     # Number of workers for collecting samples with. This only makes sense
     # to increase if your environment is particularly slow to sample, or if
-    # you're using the Ape-X optimizer.
+    # you're using the Async or Ape-X optimizers.
     num_workers=0,
     # Whether to allocate GPUs for workers (if > 0).
     num_gpus_per_worker=0,
@@ -221,13 +221,6 @@ def _train_stats(self, start_timestep):
 
         return result
 
-    def _populate_replay_buffer(self):
-        if self.remote_evaluators:
-            for e in self.remote_evaluators:
-                e.sample.remote(no_replay=True)
-        else:
-            self.local_evaluator.sample(no_replay=True)
-
     def _stop(self):
         # workaround for https://github.com/ray-project/ray/issues/1516
         for ev in self.remote_evaluators: