Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRPO agent #204

Merged
merged 32 commits into from
Mar 15, 2018
Merged

TRPO agent #204

merged 32 commits into from
Mar 15, 2018

Conversation

muupan
Copy link
Member

@muupan muupan commented Dec 16, 2017

This PR adds chainerrl.agents.TRPO, which implements the TRPO-GAE algorithm.

Resolves #202

Strangely, chainer==3.0.0 and chainer==3.1.0 have different results. With chainer==3.0.0 line search always backtracks several times, while with chainer==3.1.0 it rarely backtracks. Values of expected improve and KL div are different. This problem is solved. Current TRPO only works with 3.1.0 or later.

python examples/gym/train_trpo_gym.py --env Hopper-v1 --gpu -1

chainer==3.0.0

INFO:chainerrl.experiments.train_agent:outdir:results/20171217T050456.738713 step:1022 episode:48 R:9.23186813481
INFO:chainerrl.experiments.train_agent:statistics:[('average_value', nan), ('average_entropy', 3.7070460319519043), ('average_kl', nan), ('average_policy_step_size', nan)]
INFO:chainerrl.agents.trpo:Line search iteration: 0 step size: 1.0
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.13816076517105103
INFO:chainerrl.agents.trpo:KL divergence: 0.8066530227661133
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 1 step size: 0.5
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.08148320019245148
INFO:chainerrl.agents.trpo:KL divergence: 0.20232915878295898
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 2 step size: 0.25
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.04137067496776581
INFO:chainerrl.agents.trpo:KL divergence: 0.05066625028848648
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 3 step size: 0.125
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.020503554493188858
INFO:chainerrl.agents.trpo:KL divergence: 0.012677105143666267
INFO:chainerrl.agents.trpo:KL divergence exceeds max_kl. Bakctracking...
INFO:chainerrl.agents.trpo:Line search iteration: 4 step size: 0.0625
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.010174408555030823
INFO:chainerrl.agents.trpo:KL divergence: 0.003170591313391924

chainer==3.1.0

INFO:chainerrl.experiments.train_agent:outdir:results/20171217T050602.057921 step:1022 episode:48 R:9.23186813481
INFO:chainerrl.experiments.train_agent:statistics:[('average_value', nan), ('average_entropy', 3.7070460319519043), ('average_kl', nan), ('average_policy_step_size', nan)]
INFO:chainerrl.agents.trpo:Line search iteration: 0 step size: 1.0
INFO:chainerrl.agents.trpo:Surrogate objective improve: 0.03503759950399399
INFO:chainerrl.agents.trpo:KL divergence: 0.005451631732285023

@muupan
Copy link
Member Author

muupan commented Dec 16, 2017

I noticed 3.1.0 adds double-backprop support for relevant functions such as softplus and log. It seems, unfortunately, that calling chainer.grad twice for functions that don't support double-backprop doesn't raise an error.

code

import chainer
from chainer import functions as F
import numpy as np
x = chainer.Variable(np.zeros((1, 1), dtype=np.float32))
y = F.softplus(x)
g = chainer.grad([y], [x], enable_double_backprop=True)
gg = chainer.grad([g[0]], [x])
print('g', g)
print('gg', gg)

3.0.0

g [variable([[ 0.5]])]
gg [None]

3.1.0

g [variable([[ 0.5]])]
gg [variable([[ 0.25]])]

This explains different TRPO results.

@muupan
Copy link
Member Author

muupan commented Dec 16, 2017

I think it should raise an error if computation contains functions that don't support double-backprop, but am not sure how to detect them.

None returned by chainer.grad doesn't always mean existence of such functions, because double-backrop of y = 2 * x is None, too.

@muupan
Copy link
Member Author

muupan commented Dec 17, 2017

I added _find_old_style_function to detect old-style functions and check it before double backprop. From Chainer v3.1.0 there's no old-style functions in TRPO's tests and example.

@muupan
Copy link
Member Author

muupan commented Dec 17, 2017

TODO:

  • normalize observations
  • compare performance to openai/baselines

@muupan
Copy link
Member Author

muupan commented Dec 18, 2017

python examples/gym/train_trpo_gym.py --env Hopper-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100
trpo_hopper

python examples/gym/train_trpo_gym.py --env Walker2d-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100
trpo_walker2d

While these are single runs from random seed 0, their performance looks better than those in the PPO paper http://arxiv.org/abs/1707.06347, and comparable to http://arxiv.org/abs/1709.06560 as well.

@muupan muupan changed the title [WIP] TRPO agent TRPO agent Dec 18, 2017
Copy link
Member

@toslunar toslunar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I reviewed.



_is_double_backprop_supported = (
StrictVersion(chainer.__version__) >= StrictVersion('3.0.0'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc prereleases of chainer will fail to be parsed.

break
step_size *= 0.5
else:
self.logger.info("""\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there would be a convention that a log is a single line, it might be better to use single "s:

    self.logger.info("\
foo bar.")

or

    self.logger.info(
        "foo"
        " bar."
    )

dataset_iter = chainer.iterators.SerialIterator(
dataset, self.vf_batch_size)

dataset_iter.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() is done in the initializer of SerialIterator

r0 = b - A_product_func(x)
p = r0
for i in range(max_iter):
a = xp.dot(r0.T, r0) / xp.dot(A_product_func(p).T, p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.T has no effect since r0, A_product_func(p) are 1-dim vectors.

mean_wscale=0.01,
nonlinearity=F.tanh,
var_type='diagonal',
var_func=lambda x: F.exp(x) ** 2, # Parameterize log std
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F.exp(2 * x) could be faster.

@testing.parameterize(
*testing.product({
'n': [1, 5],
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add float32 test (with larger tol)?


def _hessian_vector_product(flat_grads, params, vec):
"""Compute hessian vector product efficiently by backprop."""
grads = _chainer_grad_with_zero([F.sum(flat_grads * vec)], params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming all the parameters are used, chainer.grad(outputs, inputs, *args, **kwargs) just works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We cannot assume that for general hessian vector product, but it's true TRPO doesn't work with unused parameters because of CG. I think chainer.grad and raising an informative error when there's None would be better. I'll fix it.

hessian = compute_hessian(y, params)
self.assertEqual(np.count_nonzero(hvp), 0)
self.assertEqual(np.count_nonzero(hessian), 0)
np.testing.assert_allclose(hvp, hessian.dot(vec), atol=1e-3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems equivalent to checking shapes, because 0 × 0 = 0.

self.policy_step_size_record = collections.deque(
maxlen=policy_step_size_stats_window)

self.xp = self.policy.xp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's allowed to put policy and vf on different devices (cpu / gpu), self.xp might be confusing.

nonlinearity=F.relu,
mean_wscale=1,
var_func=F.softplus,
var_param_init=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the name var_param_init suggests the correspondence to the names var_wscale and var_bias of FCGaussianPolicy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe var is more consistent with var_bias, while less informative. Do you think var is better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this param does not represent variance, but it represents values that are converted to variance via var_func, so I added _param. I admit it is still confusing, but I didn't come up with a better name. Any suggestion?

@muupan
Copy link
Member Author

muupan commented Mar 14, 2018

I fixed the points you mentioned, except the name of var_param_init, which I don't come up a better name.

@muupan
Copy link
Member Author

muupan commented Mar 15, 2018

The coverage with Chainer v2 decreased while that of v3 increased. I think it is because the tests of TRPO are active only for v3.

Copy link
Member

@toslunar toslunar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@toslunar toslunar merged commit 5a0bbc5 into chainer:master Mar 15, 2018
@muupan muupan added this to the v0.4 milestone Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants