TRPO agent #204

muupan · 2017-12-16T20:16:12Z

This PR adds chainerrl.agents.TRPO, which implements the TRPO-GAE algorithm.

Resolves #202

Strangely, chainer==3.0.0 and chainer==3.1.0 have different results. With chainer==3.0.0 line search always backtracks several times, while with chainer==3.1.0 it rarely backtracks. Values of expected improve and KL div are different. This problem is solved. Current TRPO only works with 3.1.0 or later.

python examples/gym/train_trpo_gym.py --env Hopper-v1 --gpu -1

chainer==3.0.0

INFO:chainerrl.experiments.train_agent:outdir:results/20171217T050456.738713 step:1022 episode:48 R:9.23186813481
INFO:chainerrl.experiments.train_agent:statistics:[('average_value', nan), ('average_entropy', 3.7070460319519043), ('average_kl', nan), ('average_policy_step_size', nan)]
INFO:chainerrl.agents.trpo:Line search iteration: 0 step size: 1.0
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.13816076517105103
INFO:chainerrl.agents.trpo:KL divergence: 0.8066530227661133
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 1 step size: 0.5
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.08148320019245148
INFO:chainerrl.agents.trpo:KL divergence: 0.20232915878295898
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 2 step size: 0.25
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.04137067496776581
INFO:chainerrl.agents.trpo:KL divergence: 0.05066625028848648
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 3 step size: 0.125
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.020503554493188858
INFO:chainerrl.agents.trpo:KL divergence: 0.012677105143666267
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 4 step size: 0.0625
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.010174408555030823
INFO:chainerrl.agents.trpo:KL divergence: 0.003170591313391924

chainer==3.1.0

INFO:chainerrl.experiments.train_agent:outdir:results/20171217T050602.057921 step:1022 episode:48 R:9.23186813481
INFO:chainerrl.experiments.train_agent:statistics:[('average_value', nan), ('average_entropy', 3.7070460319519043), ('average_kl', nan), ('average_policy_step_size', nan)]
INFO:chainerrl.agents.trpo:Line search iteration: 0 step size: 1.0
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.03503759950399399
INFO:chainerrl.agents.trpo:KL divergence: 0.005451631732285023

Line search always backtracks several times. I need to check if it is a normal behavior, or something is wrong in the current implementation.

muupan · 2017-12-16T20:40:58Z

I noticed 3.1.0 adds double-backprop support for relevant functions such as softplus and log. It seems, unfortunately, that calling chainer.grad twice for functions that don't support double-backprop doesn't raise an error.

code

import chainer
from chainer import functions as F
import numpy as np
x = chainer.Variable(np.zeros((1, 1), dtype=np.float32))
y = F.softplus(x)
g = chainer.grad([y], [x], enable_double_backprop=True)
gg = chainer.grad([g[0]], [x])
print('g', g)
print('gg', gg)

3.0.0

g [variable([[ 0.5]])]
gg [None]

3.1.0

g [variable([[ 0.5]])]
gg [variable([[ 0.25]])]

This explains different TRPO results.

muupan · 2017-12-16T20:53:56Z

I think it should raise an error if computation contains functions that don't support double-backprop, but am not sure how to detect them.

None returned by chainer.grad doesn't always mean existence of such functions, because double-backrop of y = 2 * x is None, too.

muupan · 2017-12-17T04:01:58Z

I added _find_old_style_function to detect old-style functions and check it before double backprop. From Chainer v3.1.0 there's no old-style functions in TRPO's tests and example.

muupan · 2017-12-17T06:46:17Z

TODO:

~~normalize observations~~
~~compare performance to openai/baselines~~

double-backprop

If update_interval=50 and the length of training is 50, TRPO didn't update the model at all before this change.

muupan · 2017-12-18T03:12:44Z

python examples/gym/train_trpo_gym.py --env Hopper-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100

python examples/gym/train_trpo_gym.py --env Walker2d-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100

While these are single runs from random seed 0, their performance looks better than those in the PPO paper http://arxiv.org/abs/1707.06347, and comparable to http://arxiv.org/abs/1709.06560 as well.

toslunar

Thanks. I reviewed.

toslunar · 2017-12-19T02:57:13Z

chainerrl/agents/trpo.py

+
+
+_is_double_backprop_supported = (
+    StrictVersion(chainer.__version__) >= StrictVersion('3.0.0'))


rc prereleases of chainer will fail to be parsed.

toslunar · 2017-12-19T03:07:32Z

chainerrl/agents/trpo.py

+                break
+            step_size *= 0.5
+        else:
+            self.logger.info("""\


If there would be a convention that a log is a single line, it might be better to use single "s:

self.logger.info("\ foo bar.")

or

self.logger.info( "foo" " bar." )

toslunar · 2017-12-19T04:56:00Z

chainerrl/agents/trpo.py

+        dataset_iter = chainer.iterators.SerialIterator(
+            dataset, self.vf_batch_size)
+
+        dataset_iter.reset()


reset() is done in the initializer of SerialIterator

toslunar · 2017-12-19T06:31:45Z

chainerrl/misc/conjugate_gradient.py

+    r0 = b - A_product_func(x)
+    p = r0
+    for i in range(max_iter):
+        a = xp.dot(r0.T, r0) / xp.dot(A_product_func(p).T, p)


.T has no effect since r0, A_product_func(p) are 1-dim vectors.

toslunar · 2017-12-19T06:49:53Z

examples/gym/train_trpo_gym.py

+                mean_wscale=0.01,
+                nonlinearity=F.tanh,
+                var_type='diagonal',
+                var_func=lambda x: F.exp(x) ** 2,  # Parameterize log std


F.exp(2 * x) could be faster.

toslunar · 2017-12-19T07:47:51Z

tests/misc_tests/test_conjugate_gradient.py

+@testing.parameterize(
+    *testing.product({
+        'n': [1, 5],
+    })


Could you add float32 test (with larger tol)?

toslunar · 2017-12-19T10:20:19Z

chainerrl/agents/trpo.py

+
+def _hessian_vector_product(flat_grads, params, vec):
+    """Compute hessian vector product efficiently by backprop."""
+    grads = _chainer_grad_with_zero([F.sum(flat_grads * vec)], params)


Assuming all the parameters are used, chainer.grad(outputs, inputs, *args, **kwargs) just works.

Right. We cannot assume that for general hessian vector product, but it's true TRPO doesn't work with unused parameters because of CG. I think chainer.grad and raising an informative error when there's None would be better. I'll fix it.

toslunar · 2017-12-20T07:02:24Z

tests/agents_tests/test_trpo.py

+            hessian = compute_hessian(y, params)
+            self.assertEqual(np.count_nonzero(hvp), 0)
+            self.assertEqual(np.count_nonzero(hessian), 0)
+            np.testing.assert_allclose(hvp, hessian.dot(vec), atol=1e-3)


This line seems equivalent to checking shapes, because 0 × 0 = 0.

toslunar · 2017-12-20T07:09:08Z

chainerrl/agents/trpo.py

+        self.policy_step_size_record = collections.deque(
+            maxlen=policy_step_size_stats_window)
+
+        self.xp = self.policy.xp


If it's allowed to put policy and vf on different devices (cpu / gpu), self.xp might be confusing.

toslunar · 2017-12-20T07:26:28Z

chainerrl/policies/gaussian_policy.py

+                 nonlinearity=F.relu,
+                 mean_wscale=1,
+                 var_func=F.softplus,
+                 var_param_init=0,


Do you think the name var_param_init suggests the correspondence to the names var_wscale and var_bias of FCGaussianPolicy?

Maybe var is more consistent with var_bias, while less informative. Do you think var is better?

Actually, this param does not represent variance, but it represents values that are converted to variance via var_func, so I added _param. I admit it is still confusing, but I didn't come up with a better name. Any suggestion?

muupan · 2018-03-14T10:23:40Z

I fixed the points you mentioned, except the name of var_param_init, which I don't come up a better name.

muupan · 2018-03-15T04:51:06Z

The coverage with Chainer v2 decreased while that of v3 increased. I think it is because the tests of TRPO are active only for v3.

toslunar

LGTM

muupan added 6 commits December 14, 2017 17:51

Add chainerrl.misc.conjugate_gradient

c9fbaa4

Add tests for conjugate_gradient

c99d554

Add TRPO agent

fea10ec

Line search always backtracks several times. I need to check if it is a normal behavior, or something is wrong in the current implementation.

Check return type of conjugate_gradient

89ccaac

Improve docstring of envs.ABC

b215013

Add a TRPO example for gym

630ff06

muupan added 3 commits December 17, 2017 10:06

Use policies.FCGaussianPolicyWithStateIndependentCovariance for tests

1b4a68c

Simplify code

3bbb159

Check if the comuptation graph contains old-style functions

f8d4f74

muupan added 9 commits December 17, 2017 18:10

Set entropy_coef=0

3b1513a

It doesn't work with 3.0.0 because of insufficient support of

114fefc

double-backprop

Parameterize variance as log std

fd4ce71

Allow saved attributes to be None

96647cd

Add obs_normalizer and conjugate_gradient_max_iter

6bfe868

Use settings of http://arxiv.org/abs/1709.06560

2795155

Update on stop_episode_and_train as well as act_and_train

77a2871

If update_interval=50 and the length of training is 50, TRPO didn't update the model at all before this change.

Add --trpo-update-interval

7e8f3ba

Add train_trpo_gym.py to test_examples.sh

1fa5a5e

muupan changed the title ~~[WIP] TRPO agent~~ TRPO agent Dec 18, 2017

toslunar requested changes Dec 20, 2017

View reviewed changes

muupan added 5 commits December 28, 2017 16:30

Merge branch 'master' into trpo

90f39d2

Use different seeds for train and test envs

fde05d8

Merge branch 'master' into trpo

396b947

Use exp(2*x) instead of exp(x)**2

65f3524

Test with differnet dtypes

514c0f3

muupan added 7 commits February 13, 2018 16:48

Remove unnecessary transpose

4cdc3a7

Remove unnecessary dataset_iter.reset()

0534d73

Compute CG answer without inv_mat

77b3198

Use chainer.grad and raise an error for None grads

41b6f64

Fix style of long string literals

c564ca3

Check xp consistency

c65278a

Use pkg_resources.parse_version to handle rc and b

06ad59e

muupan added 2 commits March 15, 2018 11:13

Fix a flake8 error

ed600de

Merge branch 'master' into trpo

4d4c1cc

toslunar approved these changes Mar 15, 2018

View reviewed changes

toslunar merged commit 5a0bbc5 into chainer:master Mar 15, 2018

muupan added the enhancement label Jul 23, 2018

muupan added this to the v0.4 milestone Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRPO agent #204

TRPO agent #204

muupan commented Dec 16, 2017 •

edited

Loading

muupan commented Dec 16, 2017 •

edited

Loading

muupan commented Dec 16, 2017

muupan commented Dec 17, 2017

muupan commented Dec 17, 2017 •

edited

Loading

muupan commented Dec 18, 2017 •

edited

Loading

toslunar left a comment

toslunar Dec 19, 2017

toslunar Dec 19, 2017

toslunar Dec 19, 2017

toslunar Dec 19, 2017

toslunar Dec 19, 2017

toslunar Dec 19, 2017

toslunar Dec 19, 2017

muupan Dec 20, 2017

toslunar Dec 20, 2017

toslunar Dec 20, 2017

toslunar Dec 20, 2017

muupan Dec 20, 2017

muupan Feb 13, 2018

muupan commented Mar 14, 2018

muupan commented Mar 15, 2018

toslunar left a comment



		_is_double_backprop_supported = (
		StrictVersion(chainer.__version__) >= StrictVersion('3.0.0'))

TRPO agent #204

TRPO agent #204

Conversation

muupan commented Dec 16, 2017 • edited Loading

muupan commented Dec 16, 2017 • edited Loading

muupan commented Dec 16, 2017

muupan commented Dec 17, 2017

muupan commented Dec 17, 2017 • edited Loading

muupan commented Dec 18, 2017 • edited Loading

toslunar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muupan commented Mar 14, 2018

muupan commented Mar 15, 2018

toslunar left a comment

Choose a reason for hiding this comment

muupan commented Dec 16, 2017 •

edited

Loading

muupan commented Dec 16, 2017 •

edited

Loading

muupan commented Dec 17, 2017 •

edited

Loading

muupan commented Dec 18, 2017 •

edited

Loading