You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`[paper] <https://arxiv.org/abs/1706.02275>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/contrib/maddpg/maddpg.py>`__ MADDPG is a specialized multi-agent algorithm. Code here is adapted from https://github.com/openai/maddpg to integrate with RLlib multi-agent APIs. Please check `wsjeon/maddpg-rllib <https://github.com/wsjeon/maddpg-rllib>`__ for examples and more information.
342
342
343
343
**MADDPG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
Copy file name to clipboardExpand all lines: doc/source/rllib-concepts.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -407,7 +407,7 @@ The action sampler is straightforward, it just takes the q_model, runs a forward
407
407
config):
408
408
# do max over Q values...
409
409
...
410
-
return action, action_prob
410
+
return action, action_logp
411
411
412
412
The remainder of DQN is similar to other algorithms. Target updates are handled by a ``after_optimizer_step`` callback that periodically copies the weights of the Q network to the target.
You can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym `environment name <https://gym.openai.com/envs>`__. Custom env classes passed directly to the trainer must take a single ``env_config`` parameter in their constructor:
31
37
@@ -69,9 +75,6 @@ For a full runnable code example using the custom environment API, see `custom_e
69
75
70
76
The gym registry is not compatible with Ray. Instead, always use the registration flows documented above to ensure Ray workers can access the environment.
71
77
72
-
Configuring Environments
73
-
------------------------
74
-
75
78
In the above example, note that the ``env_creator`` function takes in an ``env_config`` object. This is a dict containing options passed in through your trainer. You can also access ``env_config.worker_index`` and ``env_config.vector_index`` to get the worker id and env id within the worker (if ``num_envs_per_worker > 0``). This can be useful if you want to train over an ensemble of different environments, for example:
The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an ``Environment``, which given an action produces an observation. The observation is preprocessed by a ``Preprocessor`` and ``Filter`` (e.g. for running mean normalization) before being sent to a neural network ``Model``. The model output is in turn interpreted by an ``ActionDistribution`` to determine the next action.
5
5
@@ -145,6 +145,7 @@ Custom preprocessors should subclass the RLlib `preprocessor class <https://gith
145
145
146
146
import ray
147
147
import ray.rllib.agents.ppo as ppo
148
+
from ray.rllib.models import ModelCatalog
148
149
from ray.rllib.models.preprocessors import Preprocessor
149
150
150
151
classMyPreprocessorClass(Preprocessor):
@@ -164,6 +165,40 @@ Custom preprocessors should subclass the RLlib `preprocessor class <https://gith
164
165
},
165
166
})
166
167
168
+
Custom Action Distributions
169
+
---------------------------
170
+
171
+
Similar to custom models and preprocessors, you can also specify a custom action distribution class as follows. The action dist class is passed a reference to the ``model``, which you can use to access ``model.model_config`` or other attributes of the model. This is commonly used to implement `autoregressive action outputs <#autoregressive-action-distributions>`__.
172
+
173
+
.. code-block:: python
174
+
175
+
import ray
176
+
import ray.rllib.agents.ppo as ppo
177
+
from ray.rllib.models import ModelCatalog
178
+
from ray.rllib.models.preprocessors import Preprocessor
@@ -231,26 +266,119 @@ Custom models can be used to work with environments where (1) the set of valid a
231
266
return action_logits + inf_mask, state
232
267
233
268
234
-
Depending on your use case it may make sense to use just the masking, just action embeddings, or both. For a runnable example of this in code, check out `parametric_action_cartpole.py <https://github.com/ray-project/ray/blob/master/rllib/examples/parametric_action_cartpole.py>`__. Note that since masking introduces ``tf.float32.min`` values into the model output, this technique might not work with all algorithm options. For example, algorithms might crash if they incorrectly process the ``tf.float32.min`` values. The cartpole example has working configurations for DQN (must set ``hiddens=[]``), PPO (must disable running mean and set ``vf_share_layers=True``), and several other algorithms.
269
+
Depending on your use case it may make sense to use just the masking, just action embeddings, or both. For a runnable example of this in code, check out `parametric_action_cartpole.py <https://github.com/ray-project/ray/blob/master/rllib/examples/parametric_action_cartpole.py>`__. Note that since masking introduces ``tf.float32.min`` values into the model output, this technique might not work with all algorithm options. For example, algorithms might crash if they incorrectly process the ``tf.float32.min`` values. The cartpole example has working configurations for DQN (must set ``hiddens=[]``), PPO (must disable running mean and set ``vf_share_layers=True``), and several other algorithms. Not all algorithms support parametric actions; see the `feature compatibility matrix <rllib-env.html#feature-compatibility-matrix>`__.
235
270
236
-
Model-Based Rollouts
237
-
~~~~~~~~~~~~~~~~~~~~
238
271
239
-
With a custom policy, you can also perform model-based rollouts and optionally incorporate the results of those rollouts as training data. For example, suppose you wanted to extend PGPolicy for model-based rollouts. This involves overriding the ``compute_actions`` method of that policy:
272
+
Autoregressive Action Distributions
273
+
-----------------------------------
274
+
275
+
In an action space with multiple components (e.g., ``Tuple(a1, a2)``), you might want ``a2`` to be conditioned on the sampled value of ``a1``, i.e., ``a2_sampled ~ P(a2 | a1_sampled, obs)``. Normally, ``a1`` and ``a2`` would be sampled independently, reducing the expressivity of the policy.
276
+
277
+
To do this, you need both a custom model that implements the autoregressive pattern, and a custom action distribution class that leverages that model. The `autoregressive_action_dist.py <https://github.com/ray-project/ray/blob/master/rllib/examples/autoregressive_action_dist.py>`__ example shows how this can be implemented for a simple binary action space. For a more complex space, a more efficient architecture such as a `MADE <https://arxiv.org/abs/1502.03509>`__ is recommended. Note that sampling a `N-part` action requires `N` forward passes through the model, however computing the log probability of an action can be done in one pass:
240
278
241
279
.. code-block:: python
242
280
243
-
classModelBasedPolicy(PGPolicy):
244
-
defcompute_actions(self,
245
-
obs_batch,
246
-
state_batches,
247
-
prev_action_batch=None,
248
-
prev_reward_batch=None,
249
-
episodes=None):
250
-
# compute a batch of actions based on the current obs_batch
251
-
# and state of each episode (i.e., for multiagent). You can do
If you want take this rollouts data and append it to the sample batch, use the ``add_extra_batch()`` method of the `episode objects <https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py>`__ passed in. For an example of this, see the ``testReturningModelBasedRolloutsData`` `unit test <https://github.com/ray-project/ray/blob/master/rllib/tests/test_multi_agent_env.py>`__.
384
+
Not all algorithms support autoregressive action distributions; see the `feature compatibility matrix <rllib-env.html#feature-compatibility-matrix>`__.
0 commit comments