[RLLib] DDPG #1685

alvkao58 · 2018-03-08T21:06:52Z

Implemented DDPG Algorithm.

richardliaw · 2018-03-08T21:13:11Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+        """Returns a batch of samples."""
+        # act in the environment, generate new samples
+        rollout = self.sampler.get_data()
+        samples = process_rollout(


make sure observations + new_obs treatment is the same

AmplabJenkins · 2018-03-08T22:11:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4210/
Test PASSed.

ericl · 2018-03-09T08:19:38Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+        self.sess.run(tf.global_variables_initializer())
+
+    #TODO: (not critical) Add batch normalization?
+    def sample(self, no_replay = True):


Is it possible to remove the replay handling from DDPG, and use SyncLocalReplayOptimizer / ApexOptimizer instead? Recently we refactored DQN to have replay be handled by the optimizer classes.

ericl · 2018-03-09T08:20:05Z

python/ray/rllib/ddpg/replay_buffer.py

@@ -0,0 +1,197 @@
+from __future__ import absolute_import


Note: a copy of these classes in in ray.rllib.optimizers already

ericl · 2018-03-09T08:21:02Z

python/ray/rllib/agent.py

@@ -240,6 +240,9 @@ def get_agent_class(alg):
    elif alg == "PG":
        from ray.rllib import pg
        return pg.PGAgent
+    elif alg == "DDPG":
+        from ray.rllib import ddpg
+        return ddpg.DDPGAgent


Nice. It would be good to also

make an issue to update the docs; there are a couple new algs not documented by now

add a DDPG example to the regression tests folder (tuned_examples/regression_tests)

add a DDPG sanity check to multi_node_tests.sh

@alvkao58 can you take care of these comments by Eric? (see PR)

any particular sanity checks you want to see added to multi_node_tests.sh?

AmplabJenkins · 2018-03-15T00:17:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4338/
Test FAILed.

AmplabJenkins · 2018-03-22T03:55:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4466/
Test FAILed.

richardliaw · 2018-03-22T06:06:00Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+    def compute_gradients(self, samples):
+        """ Returns gradient w.r.t. samples."""
+        # actor gradients
+        actor_actions = self.sess.run(self.model.output_action,


don't you track the actor samples in samples["actions"] or something?

yeah, but looking at https://arxiv.org/pdf/1509.02971.pdf, aren't the actions according to what the actor currently outputs?

ok yeah good point

richardliaw

some comments

richardliaw · 2018-03-22T07:55:17Z

python/ray/rllib/ddpg/ddpg.py

+    def _train(self):
+        self.optimizer.step()
+        # update target
+        self.local_evaluator.update_target()


I think update_target is actually supposed to happen very often

richardliaw · 2018-03-22T07:59:51Z

python/ray/rllib/ddpg/models.py

+
+        self.obs_and_actor = tf.concat([self.obs, self.output_action], 1) #output_action is output of actor network
+        with tf.variable_scope("critic", reuse=True):
+            self.cn_for_loss = FullyConnectedNetwork(self.obs_and_actor,


Maybe let's try being a bit more explicit for now, just getting a specific action_gradient.

See self.action_grads in http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html

AmplabJenkins · 2018-03-22T20:37:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4472/
Test FAILed.

AmplabJenkins · 2018-03-22T20:47:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4473/
Test FAILed.

AmplabJenkins · 2018-03-22T21:04:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4474/
Test FAILed.

AmplabJenkins · 2018-03-30T04:02:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4548/
Test FAILed.

AmplabJenkins · 2018-03-31T21:55:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4578/
Test FAILed.

AmplabJenkins · 2018-04-01T05:10:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4586/
Test FAILed.

alvkao58 · 2018-04-01T05:56:27Z

Plot of average reward over last 10 episodes (episodes are 200 steps) for Pendulum-v0.

Plot shows 80 episodes in total, average reward starts noticeably increasing at around 60 episodes.

richardliaw · 2018-04-01T09:01:53Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+        critic_grad = self.sess.run(self.critic_grads,
+                                    feed_dict=critic_feed_dict)
+
+        return (critic_grad, actor_grad), {}


Let's move all the tensorflow specific things out of this class.

richardliaw · 2018-04-01T09:02:36Z

python/ray/rllib/ddpg/models.py

+                               self.obs, self.output_action)
+
+    def _create_critic_network(self, obs, action):
+        net = tflearn.fully_connected(obs, 400)


Let's move this back to slim and preferably as a Model we can import from ModelCatalog

richardliaw · 2018-04-01T09:04:40Z

python/ray/rllib/utils/sampler.py

@@ -20,7 +20,7 @@ class PartialRollout(object):
        last_r (float): Value of next state. Used for bootstrapping.
    """

-    fields = ["observations", "actions", "rewards", "terminal", "features"]
+    fields = ["obs", "actions", "rewards", "new_obs", "dones", "features"]


remind me again why we are changing terminal to dones?

I did it for convenience because LocalSyncReplayOptimizer uses "obs", "new_obs", and "dones". I can change it back if necessary.

I made this consistent across all algorithms that use this sampler.

richardliaw · 2018-04-01T09:06:08Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+
+        with tf.variable_scope("model"):
+            self.model = DDPGModel(self.registry,
+                                   self.env,


this ends up being a little verbose; consider

self.model = DDPGModel( self.registry, self.env, ... )

I think that if I don't do this, the model and target_model end up sharing weights?

richardliaw · 2018-04-01T09:07:15Z

python/ray/rllib/ddpg/ddpg.py

+        for s in stats:
+            mean_10ep_reward += s["mean_10ep_reward"] / len(stats)
+            mean_10ep_length += s["mean_10ep_length"] / len(stats)
+            num_episodes += s["num_episodes"]


would you want to keep this as a running average?

Is episode_reward_mean in TrainingResult meant to give an average from the start of time?

It will give an average over the number of episodes finished since the last TrainingResult.

I think here we should just do a lot of optimizer steps and return a TrainingResult only after a while.

richardliaw · 2018-04-01T09:08:18Z

python/ray/rllib/ddpg/ddpg_evaluator.py

@@ -0,0 +1,210 @@
+# imports


I think we should move all of the Tensorflow stuff out of this class; it would make it a lot easier for us to support PyTorch.

AmplabJenkins · 2018-04-02T22:39:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4617/
Test FAILed.

AmplabJenkins · 2018-04-02T23:10:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4618/
Test FAILed.

alvkao58 · 2018-04-03T21:07:08Z

retest please

richardliaw · 2018-04-03T22:01:04Z

retest this please

AmplabJenkins · 2018-04-03T22:27:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4635/
Test FAILed.

alvkao58 · 2018-04-05T00:12:29Z

Sample reward graph after changes.

AmplabJenkins · 2018-04-05T00:37:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4659/
Test FAILed.

richardliaw · 2018-04-05T01:15:34Z

python/ray/rllib/ddpg/models.py

+                                  weights_initializer=w_normal)
+        net = tf.nn.relu(tf.add(t1, t2))
+
+        out = slim.fully_connected(net,


** a note throughout- can we not do this style of indentation? and change it to

net = slim.fully_connected( obs, 400, activation_fn=tf.nn.relu, weights_initializer=w_normal)

richardliaw · 2018-04-05T01:16:29Z

python/ray/rllib/ddpg/models.py

+
+    def _create_critic_network(self, obs, action):
+        """Network for critic."""
+        w_normal = tf.truncated_normal_initializer()


can we move this into the models directory?

richardliaw · 2018-04-05T01:18:51Z

python/ray/rllib/ddpg/models.py

+    def _setup_target_updates(self):
+        """Set up target actor and critic updates."""
+        a_updates = []
+        for var, target_var in zip(self.model.actor_var_list,


formatting for this is also a little odd; would be great to fix this.

You can reduce verbosity by just setting tau = self.config["tau"]

richardliaw · 2018-04-05T01:20:15Z

python/ray/rllib/ddpg/models.py

+        self.sess = tf.Session()
+
+        with tf.variable_scope("model"):
+            self.model = DDPGActorCritic(registry,


see ** a note throughout

richardliaw · 2018-04-05T01:38:23Z

The graph is without batchnorm?

AmplabJenkins · 2018-04-06T21:01:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4701/
Test FAILed.

alvkao58 · 2018-04-07T22:14:09Z

retest this please

AmplabJenkins · 2018-04-07T22:32:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4722/
Test FAILed.

AmplabJenkins · 2018-04-08T19:19:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4738/
Test FAILed.

AmplabJenkins · 2018-04-08T19:39:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4739/
Test FAILed.

AmplabJenkins · 2018-04-08T22:39:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4742/
Test FAILed.

AmplabJenkins · 2018-04-08T23:41:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4743/
Test FAILed.

alvkao58 · 2018-04-09T04:01:21Z

retest this please

AmplabJenkins · 2018-04-09T05:03:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4748/
Test PASSed.

AmplabJenkins · 2018-04-09T18:04:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4750/
Test FAILed.

AmplabJenkins · 2018-04-09T23:06:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4762/
Test FAILed.

alvkao58 · 2018-04-10T03:34:53Z

retest this please

AmplabJenkins · 2018-04-10T04:04:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4769/
Test FAILed.

AmplabJenkins · 2018-04-10T05:35:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4770/
Test PASSed.

richardliaw

OK this is looking really good! We have one last thing, which is to add this to our test suite in multi_node_tests.sh and add an example to the regression tests folder.

Otherwise, this is ready to merge (and will probably be done after #1868)

richardliaw · 2018-04-10T23:16:41Z

python/ray/rllib/ddpg/ddpg_evaluator.py

+
+        samples = process_rollout(
+                    rollout, NoFilter(),
+                    gamma=1.0, use_gae=False)


nit: we should make a comment as to why this gamma (or should not be the gamma from the config)

richardliaw · 2018-04-10T23:18:04Z

python/ray/rllib/ddpg/ddpg.py

+                episode_lengths.append(episode.episode_length)
+                episode_rewards.append(episode.episode_reward)
+        avg_reward = (
+            np.mean(episode_rewards) if episode_rewards else float('nan'))


nit: I believe you can just do np.mean(episode_rewards) and np.sum (below)to the same effect without the if-else cases

richardliaw · 2018-04-10T23:23:28Z

python/ray/rllib/ddpg/ddpg.py

+    # Number of local steps taken for each call to sample
+    "num_local_steps": 1,
+    # Number of workers (excluding master)
+    "num_workers": 1,


what happens when num_workers is 0? does the algorith still output anything?

As of now, it doesn't output anything when num_workers is 0, but I could update the stats so it takes metrics from the local evaluator when there's no remote evaluators.

OK let's do that

richardliaw · 2018-04-10T23:27:58Z

python/ray/rllib/ddpg/models.py

+
+    def get_weights(self):
+        """Returns critic weights, actor weights."""
+        return self.critic_vars.get_weights(), self.actor_vars.get_weights()


nit: it would be great if we could just have this as one return dict

Correct me if I'm wrong, but I think that would make setting weights more annoying?

That's true but only slightly; the benefit of doing this is to that different optimizers can use the DDPG evaluator (which is one of the main points of RLlib)

OK, do I also need to put the model and target_model weights in the same dictionary?

OK let's leave this as is and fix it in a later PR if necessary.

richardliaw · 2018-04-10T23:30:17Z

python/ray/rllib/agent.py

@@ -240,6 +240,9 @@ def get_agent_class(alg):
    elif alg == "PG":
        from ray.rllib import pg
        return pg.PGAgent
+    elif alg == "DDPG":
+        from ray.rllib import ddpg
+        return ddpg.DDPGAgent


@alvkao58 can you take care of these comments by Eric? (see PR)

AmplabJenkins · 2018-04-11T05:07:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4799/
Test FAILed.

AmplabJenkins · 2018-04-11T06:19:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4800/
Test PASSed.

* master: (56 commits) [xray] Turn on flushing to the GCS for the lineage cache (ray-project#1907) Single Big Object Parallel Transfer. (ray-project#1827) Remove num_threads as a parameter. (ray-project#1891) Adds Valgrind tests for multi-threaded object manager. (ray-project#1890) Pin cython version in docker base dependencies file. (ray-project#1898) Update arrow to efficiently serialize more types of numpy arrays. (ray-project#1889) updates (ray-project#1896) [DataFrame] Inherit documentation from Pandas (ray-project#1727) Update arrow and parquet-cpp. (ray-project#1875) raylet command line resource configuration plumbing (ray-project#1882) use raylet for remote ray nodes (ray-project#1880) [rllib] Propagate dim option to deepmind wrappers (ray-project#1876) [RLLib] DDPG (ray-project#1685) Lint Python files with Yapf (ray-project#1872) [DataFrame] Fixed repr, info, and memory_usage (ray-project#1874) Fix getattr compat (ray-project#1871) check if arrow build dir exists (ray-project#1863) [DataFrame] Encapsulate index and lengths into separate class (ray-project#1849) [DataFrame] Implemented __getattr__ (ray-project#1753) Add better analytics to docs (ray-project#1854) ... # Conflicts: # python/ray/rllib/__init__.py # python/setup.py

richardliaw reviewed Mar 8, 2018

View reviewed changes

ericl reviewed Mar 9, 2018

View reviewed changes

richardliaw changed the title ~~[RLLib] DDPG~~ [RLLib] (wip) DDPG Mar 9, 2018

richardliaw reviewed Mar 22, 2018

View reviewed changes

richardliaw requested changes Mar 22, 2018

View reviewed changes

richardliaw requested changes Apr 1, 2018

View reviewed changes

richardliaw requested changes Apr 5, 2018

View reviewed changes

alvkao58 added 3 commits April 8, 2018 11:45

changed stats to support remote evaluators

9c9b61b

made requested formatting changes

fc7b9d3

moved actor, critic networks into model directory

a759eb2

richardliaw force-pushed the ddpg branch from ef39422 to a759eb2 Compare April 9, 2018 17:48

Merge branch 'master' into ddpg

90fad1e

fix from merging

1d76eac

added back changes that were removed when fixing rebasing

e5d0649

ray-project deleted a comment from alvkao58 Apr 10, 2018

richardliaw reviewed Apr 10, 2018

View reviewed changes

made requested touch ups

aab9570

fix test syntax

831c333

ericl mentioned this pull request Apr 11, 2018

[rllib] Contribute DDPG to RLlib #1877

Merged

richardliaw changed the title ~~[RLLib] (wip) DDPG~~ [RLLib] DDPG Apr 11, 2018

richardliaw approved these changes Apr 11, 2018

View reviewed changes

richardliaw merged commit 15a668d into ray-project:master Apr 11, 2018

[RLLib] DDPG #1685

[RLLib] DDPG #1685

Uh oh!

Conversation

alvkao58 commented Mar 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 15, 2018

Uh oh!

AmplabJenkins commented Mar 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 22, 2018

Uh oh!

AmplabJenkins commented Mar 22, 2018

Uh oh!

AmplabJenkins commented Mar 22, 2018

Uh oh!

AmplabJenkins commented Mar 30, 2018

Uh oh!

AmplabJenkins commented Mar 31, 2018

Uh oh!

AmplabJenkins commented Apr 1, 2018

Uh oh!

alvkao58 commented Apr 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 2, 2018

Uh oh!

AmplabJenkins commented Apr 2, 2018

ericl Mar 9, 2018 •

edited

Loading

richardliaw Apr 10, 2018 •

edited

Loading