-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[RLLib] DDPG #1685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLLib] DDPG #1685
Conversation
"""Returns a batch of samples.""" | ||
# act in the environment, generate new samples | ||
rollout = self.sampler.get_data() | ||
samples = process_rollout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure observations + new_obs treatment is the same
Test PASSed. |
self.sess.run(tf.global_variables_initializer()) | ||
|
||
#TODO: (not critical) Add batch normalization? | ||
def sample(self, no_replay = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to remove the replay handling from DDPG, and use SyncLocalReplayOptimizer / ApexOptimizer instead? Recently we refactored DQN to have replay be handled by the optimizer classes.
@@ -0,0 +1,197 @@ | |||
from __future__ import absolute_import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: a copy of these classes in in ray.rllib.optimizers already
@@ -240,6 +240,9 @@ def get_agent_class(alg): | |||
elif alg == "PG": | |||
from ray.rllib import pg | |||
return pg.PGAgent | |||
elif alg == "DDPG": | |||
from ray.rllib import ddpg | |||
return ddpg.DDPGAgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It would be good to also
- make an issue to update the docs; there are a couple new algs not documented by now
- add a DDPG example to the regression tests folder (tuned_examples/regression_tests)
- add a DDPG sanity check to multi_node_tests.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvkao58 can you take care of these comments by Eric? (see PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular sanity checks you want to see added to multi_node_tests.sh?
Test FAILed. |
Test FAILed. |
def compute_gradients(self, samples): | ||
""" Returns gradient w.r.t. samples.""" | ||
# actor gradients | ||
actor_actions = self.sess.run(self.model.output_action, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you track the actor samples in samples["actions"]
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but looking at https://arxiv.org/pdf/1509.02971.pdf, aren't the actions according to what the actor currently outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
python/ray/rllib/ddpg/ddpg.py
Outdated
def _train(self): | ||
self.optimizer.step() | ||
# update target | ||
self.local_evaluator.update_target() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think update_target is actually supposed to happen very often
python/ray/rllib/ddpg/models.py
Outdated
|
||
self.obs_and_actor = tf.concat([self.obs, self.output_action], 1) #output_action is output of actor network | ||
with tf.variable_scope("critic", reuse=True): | ||
self.cn_for_loss = FullyConnectedNetwork(self.obs_and_actor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's try being a bit more explicit for now, just getting a specific action_gradient
.
See self.action_grads
in http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
critic_grad = self.sess.run(self.critic_grads, | ||
feed_dict=critic_feed_dict) | ||
|
||
return (critic_grad, actor_grad), {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move all the tensorflow specific things out of this class.
python/ray/rllib/ddpg/models.py
Outdated
self.obs, self.output_action) | ||
|
||
def _create_critic_network(self, obs, action): | ||
net = tflearn.fully_connected(obs, 400) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this back to slim
and preferably as a Model we can import from ModelCatalog
@@ -20,7 +20,7 @@ class PartialRollout(object): | |||
last_r (float): Value of next state. Used for bootstrapping. | |||
""" | |||
|
|||
fields = ["observations", "actions", "rewards", "terminal", "features"] | |||
fields = ["obs", "actions", "rewards", "new_obs", "dones", "features"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me again why we are changing terminal
to dones
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it for convenience because LocalSyncReplayOptimizer uses "obs", "new_obs", and "dones". I can change it back if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this consistent across all algorithms that use this sampler.
|
||
with tf.variable_scope("model"): | ||
self.model = DDPGModel(self.registry, | ||
self.env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ends up being a little verbose; consider
self.model = DDPGModel(
self.registry, self.env, ... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if I don't do this, the model and target_model end up sharing weights?
python/ray/rllib/ddpg/ddpg.py
Outdated
for s in stats: | ||
mean_10ep_reward += s["mean_10ep_reward"] / len(stats) | ||
mean_10ep_length += s["mean_10ep_length"] / len(stats) | ||
num_episodes += s["num_episodes"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you want to keep this as a running average?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is episode_reward_mean in TrainingResult meant to give an average from the start of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will give an average over the number of episodes finished since the last TrainingResult.
I think here we should just do a lot of optimizer steps and return a TrainingResult only after a while.
@@ -0,0 +1,210 @@ | |||
# imports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move all of the Tensorflow stuff out of this class; it would make it a lot easier for us to support PyTorch.
Test FAILed. |
Test FAILed. |
retest please |
retest this please |
Test FAILed. |
Test FAILed. |
python/ray/rllib/ddpg/models.py
Outdated
weights_initializer=w_normal) | ||
net = tf.nn.relu(tf.add(t1, t2)) | ||
|
||
out = slim.fully_connected(net, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** a note throughout
- can we not do this style of indentation? and change it to
net = slim.fully_connected(
obs, 400, activation_fn=tf.nn.relu, weights_initializer=w_normal)
python/ray/rllib/ddpg/models.py
Outdated
|
||
def _create_critic_network(self, obs, action): | ||
"""Network for critic.""" | ||
w_normal = tf.truncated_normal_initializer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this into the models directory?
python/ray/rllib/ddpg/models.py
Outdated
def _setup_target_updates(self): | ||
"""Set up target actor and critic updates.""" | ||
a_updates = [] | ||
for var, target_var in zip(self.model.actor_var_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting for this is also a little odd; would be great to fix this.
You can reduce verbosity by just setting tau = self.config["tau"]
python/ray/rllib/ddpg/models.py
Outdated
self.sess = tf.Session() | ||
|
||
with tf.variable_scope("model"): | ||
self.model = DDPGActorCritic(registry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see ** a note throughout
The graph is without batchnorm? |
Test FAILed. |
retest this please |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
retest this please |
Test PASSed. |
Test FAILed. |
Test FAILed. |
retest this please |
Test FAILed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is looking really good! We have one last thing, which is to add this to our test suite in multi_node_tests.sh
and add an example to the regression tests folder.
Otherwise, this is ready to merge (and will probably be done after #1868)
|
||
samples = process_rollout( | ||
rollout, NoFilter(), | ||
gamma=1.0, use_gae=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should make a comment as to why this gamma (or should not be the gamma
from the config)
python/ray/rllib/ddpg/ddpg.py
Outdated
episode_lengths.append(episode.episode_length) | ||
episode_rewards.append(episode.episode_reward) | ||
avg_reward = ( | ||
np.mean(episode_rewards) if episode_rewards else float('nan')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe you can just do np.mean(episode_rewards)
and np.sum
(below)to the same effect without the if-else cases
python/ray/rllib/ddpg/ddpg.py
Outdated
# Number of local steps taken for each call to sample | ||
"num_local_steps": 1, | ||
# Number of workers (excluding master) | ||
"num_workers": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when num_workers is 0? does the algorith still output anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, it doesn't output anything when num_workers is 0, but I could update the stats so it takes metrics from the local evaluator when there's no remote evaluators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's do that
|
||
def get_weights(self): | ||
"""Returns critic weights, actor weights.""" | ||
return self.critic_vars.get_weights(), self.actor_vars.get_weights() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it would be great if we could just have this as one return dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I think that would make setting weights more annoying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true but only slightly; the benefit of doing this is to that different optimizers can use the DDPG evaluator (which is one of the main points of RLlib)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, do I also need to put the model and target_model weights in the same dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's leave this as is and fix it in a later PR if necessary.
@@ -240,6 +240,9 @@ def get_agent_class(alg): | |||
elif alg == "PG": | |||
from ray.rllib import pg | |||
return pg.PGAgent | |||
elif alg == "DDPG": | |||
from ray.rllib import ddpg | |||
return ddpg.DDPGAgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvkao58 can you take care of these comments by Eric? (see PR)
Test FAILed. |
Test PASSed. |
* master: (56 commits) [xray] Turn on flushing to the GCS for the lineage cache (ray-project#1907) Single Big Object Parallel Transfer. (ray-project#1827) Remove num_threads as a parameter. (ray-project#1891) Adds Valgrind tests for multi-threaded object manager. (ray-project#1890) Pin cython version in docker base dependencies file. (ray-project#1898) Update arrow to efficiently serialize more types of numpy arrays. (ray-project#1889) updates (ray-project#1896) [DataFrame] Inherit documentation from Pandas (ray-project#1727) Update arrow and parquet-cpp. (ray-project#1875) raylet command line resource configuration plumbing (ray-project#1882) use raylet for remote ray nodes (ray-project#1880) [rllib] Propagate dim option to deepmind wrappers (ray-project#1876) [RLLib] DDPG (ray-project#1685) Lint Python files with Yapf (ray-project#1872) [DataFrame] Fixed repr, info, and memory_usage (ray-project#1874) Fix getattr compat (ray-project#1871) check if arrow build dir exists (ray-project#1863) [DataFrame] Encapsulate index and lengths into separate class (ray-project#1849) [DataFrame] Implemented __getattr__ (ray-project#1753) Add better analytics to docs (ray-project#1854) ... # Conflicts: # python/ray/rllib/__init__.py # python/setup.py
Implemented DDPG Algorithm.