Skip to content

[RLLib] DDPG #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Apr 11, 2018
Merged

[RLLib] DDPG #1685

merged 38 commits into from
Apr 11, 2018

Conversation

alvkao58
Copy link
Contributor

@alvkao58 alvkao58 commented Mar 8, 2018

Implemented DDPG Algorithm.

"""Returns a batch of samples."""
# act in the environment, generate new samples
rollout = self.sampler.get_data()
samples = process_rollout(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure observations + new_obs treatment is the same

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4210/
Test PASSed.

self.sess.run(tf.global_variables_initializer())

#TODO: (not critical) Add batch normalization?
def sample(self, no_replay = True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove the replay handling from DDPG, and use SyncLocalReplayOptimizer / ApexOptimizer instead? Recently we refactored DQN to have replay be handled by the optimizer classes.

@@ -0,0 +1,197 @@
from __future__ import absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: a copy of these classes in in ray.rllib.optimizers already

@@ -240,6 +240,9 @@ def get_agent_class(alg):
elif alg == "PG":
from ray.rllib import pg
return pg.PGAgent
elif alg == "DDPG":
from ray.rllib import ddpg
return ddpg.DDPGAgent
Copy link
Contributor

@ericl ericl Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It would be good to also

  • make an issue to update the docs; there are a couple new algs not documented by now
  • add a DDPG example to the regression tests folder (tuned_examples/regression_tests)
  • add a DDPG sanity check to multi_node_tests.sh

Copy link
Contributor

@richardliaw richardliaw Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvkao58 can you take care of these comments by Eric? (see PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular sanity checks you want to see added to multi_node_tests.sh?

@richardliaw richardliaw changed the title [RLLib] DDPG [RLLib] (wip) DDPG Mar 9, 2018
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4338/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4466/
Test FAILed.

def compute_gradients(self, samples):
""" Returns gradient w.r.t. samples."""
# actor gradients
actor_actions = self.sess.run(self.model.output_action,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you track the actor samples in samples["actions"] or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but looking at https://arxiv.org/pdf/1509.02971.pdf, aren't the actions according to what the actor currently outputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah good point

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

def _train(self):
self.optimizer.step()
# update target
self.local_evaluator.update_target()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think update_target is actually supposed to happen very often


self.obs_and_actor = tf.concat([self.obs, self.output_action], 1) #output_action is output of actor network
with tf.variable_scope("critic", reuse=True):
self.cn_for_loss = FullyConnectedNetwork(self.obs_and_actor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's try being a bit more explicit for now, just getting a specific action_gradient.

See self.action_grads in http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4472/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4473/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4474/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4548/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4578/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4586/
Test FAILed.

@alvkao58
Copy link
Contributor Author

alvkao58 commented Apr 1, 2018

Plot of average reward over last 10 episodes (episodes are 200 steps) for Pendulum-v0.
learninggraph

Plot shows 80 episodes in total, average reward starts noticeably increasing at around 60 episodes.

critic_grad = self.sess.run(self.critic_grads,
feed_dict=critic_feed_dict)

return (critic_grad, actor_grad), {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all the tensorflow specific things out of this class.

self.obs, self.output_action)

def _create_critic_network(self, obs, action):
net = tflearn.fully_connected(obs, 400)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this back to slim and preferably as a Model we can import from ModelCatalog

@@ -20,7 +20,7 @@ class PartialRollout(object):
last_r (float): Value of next state. Used for bootstrapping.
"""

fields = ["observations", "actions", "rewards", "terminal", "features"]
fields = ["obs", "actions", "rewards", "new_obs", "dones", "features"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me again why we are changing terminal to dones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it for convenience because LocalSyncReplayOptimizer uses "obs", "new_obs", and "dones". I can change it back if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this consistent across all algorithms that use this sampler.


with tf.variable_scope("model"):
self.model = DDPGModel(self.registry,
self.env,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ends up being a little verbose; consider

self.model = DDPGModel(
    self.registry, self.env, ... )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if I don't do this, the model and target_model end up sharing weights?

for s in stats:
mean_10ep_reward += s["mean_10ep_reward"] / len(stats)
mean_10ep_length += s["mean_10ep_length"] / len(stats)
num_episodes += s["num_episodes"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you want to keep this as a running average?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is episode_reward_mean in TrainingResult meant to give an average from the start of time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will give an average over the number of episodes finished since the last TrainingResult.

I think here we should just do a lot of optimizer steps and return a TrainingResult only after a while.

@@ -0,0 +1,210 @@
# imports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move all of the Tensorflow stuff out of this class; it would make it a lot easier for us to support PyTorch.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4617/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4618/
Test FAILed.

@alvkao58
Copy link
Contributor Author

alvkao58 commented Apr 3, 2018

retest please

@richardliaw
Copy link
Contributor

retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4635/
Test FAILed.

@alvkao58
Copy link
Contributor Author

alvkao58 commented Apr 5, 2018

Sample reward graph after changes.

reward

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4659/
Test FAILed.

weights_initializer=w_normal)
net = tf.nn.relu(tf.add(t1, t2))

out = slim.fully_connected(net,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** a note throughout- can we not do this style of indentation? and change it to

net = slim.fully_connected(
     obs, 400, activation_fn=tf.nn.relu, weights_initializer=w_normal)


def _create_critic_network(self, obs, action):
"""Network for critic."""
w_normal = tf.truncated_normal_initializer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into the models directory?

def _setup_target_updates(self):
"""Set up target actor and critic updates."""
a_updates = []
for var, target_var in zip(self.model.actor_var_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting for this is also a little odd; would be great to fix this.

You can reduce verbosity by just setting tau = self.config["tau"]

self.sess = tf.Session()

with tf.variable_scope("model"):
self.model = DDPGActorCritic(registry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see ** a note throughout

@richardliaw
Copy link
Contributor

The graph is without batchnorm?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4701/
Test FAILed.

@alvkao58
Copy link
Contributor Author

alvkao58 commented Apr 7, 2018

retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4722/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4738/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4739/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4742/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4743/
Test FAILed.

@alvkao58
Copy link
Contributor Author

alvkao58 commented Apr 9, 2018

retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4748/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4750/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4762/
Test FAILed.

@alvkao58
Copy link
Contributor Author

retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4769/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4770/
Test PASSed.

@ray-project ray-project deleted a comment from alvkao58 Apr 10, 2018
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is looking really good! We have one last thing, which is to add this to our test suite in multi_node_tests.sh and add an example to the regression tests folder.

Otherwise, this is ready to merge (and will probably be done after #1868)


samples = process_rollout(
rollout, NoFilter(),
gamma=1.0, use_gae=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should make a comment as to why this gamma (or should not be the gamma from the config)

episode_lengths.append(episode.episode_length)
episode_rewards.append(episode.episode_reward)
avg_reward = (
np.mean(episode_rewards) if episode_rewards else float('nan'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe you can just do np.mean(episode_rewards) and np.sum (below)to the same effect without the if-else cases

# Number of local steps taken for each call to sample
"num_local_steps": 1,
# Number of workers (excluding master)
"num_workers": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when num_workers is 0? does the algorith still output anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, it doesn't output anything when num_workers is 0, but I could update the stats so it takes metrics from the local evaluator when there's no remote evaluators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's do that


def get_weights(self):
"""Returns critic weights, actor weights."""
return self.critic_vars.get_weights(), self.actor_vars.get_weights()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it would be great if we could just have this as one return dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I think that would make setting weights more annoying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true but only slightly; the benefit of doing this is to that different optimizers can use the DDPG evaluator (which is one of the main points of RLlib)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, do I also need to put the model and target_model weights in the same dictionary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's leave this as is and fix it in a later PR if necessary.

@@ -240,6 +240,9 @@ def get_agent_class(alg):
elif alg == "PG":
from ray.rllib import pg
return pg.PGAgent
elif alg == "DDPG":
from ray.rllib import ddpg
return ddpg.DDPGAgent
Copy link
Contributor

@richardliaw richardliaw Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvkao58 can you take care of these comments by Eric? (see PR)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4799/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4800/
Test PASSed.

@richardliaw richardliaw changed the title [RLLib] (wip) DDPG [RLLib] DDPG Apr 11, 2018
@richardliaw richardliaw merged commit 15a668d into ray-project:master Apr 11, 2018
royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* master: (56 commits)
  [xray] Turn on flushing to the GCS for the lineage cache (ray-project#1907)
  Single Big Object Parallel Transfer. (ray-project#1827)
  Remove num_threads as a parameter. (ray-project#1891)
  Adds Valgrind tests for multi-threaded object manager. (ray-project#1890)
  Pin cython version in docker base dependencies file. (ray-project#1898)
  Update arrow to efficiently serialize more types of numpy arrays. (ray-project#1889)
  updates (ray-project#1896)
  [DataFrame] Inherit documentation from Pandas (ray-project#1727)
  Update arrow and parquet-cpp. (ray-project#1875)
  raylet command line resource configuration plumbing (ray-project#1882)
  use raylet for remote ray nodes (ray-project#1880)
  [rllib] Propagate dim option to deepmind wrappers (ray-project#1876)
  [RLLib] DDPG (ray-project#1685)
  Lint Python files with Yapf (ray-project#1872)
  [DataFrame] Fixed repr, info, and memory_usage (ray-project#1874)
  Fix getattr compat (ray-project#1871)
  check if arrow build dir exists (ray-project#1863)
  [DataFrame] Encapsulate index and lengths into separate class (ray-project#1849)
  [DataFrame] Implemented __getattr__ (ray-project#1753)
  Add better analytics to docs (ray-project#1854)
  ...

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/setup.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants