-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRPO agent #204
TRPO agent #204
Conversation
Line search always backtracks several times. I need to check if it is a normal behavior, or something is wrong in the current implementation.
I noticed 3.1.0 adds double-backprop support for relevant functions such as softplus and log. It seems, unfortunately, that calling code
3.0.0
3.1.0
This explains different TRPO results. |
I think it should raise an error if computation contains functions that don't support double-backprop, but am not sure how to detect them.
|
I added |
TODO:
|
double-backprop
If update_interval=50 and the length of training is 50, TRPO didn't update the model at all before this change.
python examples/gym/train_trpo_gym.py --env Hopper-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100 python examples/gym/train_trpo_gym.py --env Walker2d-v1 --steps 2000000 --eval-interval 100000 --eval-n-runs 100 While these are single runs from random seed 0, their performance looks better than those in the PPO paper http://arxiv.org/abs/1707.06347, and comparable to http://arxiv.org/abs/1709.06560 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I reviewed.
chainerrl/agents/trpo.py
Outdated
|
||
|
||
_is_double_backprop_supported = ( | ||
StrictVersion(chainer.__version__) >= StrictVersion('3.0.0')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rc
prereleases of chainer will fail to be parsed.
chainerrl/agents/trpo.py
Outdated
break | ||
step_size *= 0.5 | ||
else: | ||
self.logger.info("""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there would be a convention that a log is a single line, it might be better to use single "
s:
self.logger.info("\
foo bar.")
or
self.logger.info(
"foo"
" bar."
)
chainerrl/agents/trpo.py
Outdated
dataset_iter = chainer.iterators.SerialIterator( | ||
dataset, self.vf_batch_size) | ||
|
||
dataset_iter.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset()
is done in the initializer of SerialIterator
chainerrl/misc/conjugate_gradient.py
Outdated
r0 = b - A_product_func(x) | ||
p = r0 | ||
for i in range(max_iter): | ||
a = xp.dot(r0.T, r0) / xp.dot(A_product_func(p).T, p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.T
has no effect since r0
, A_product_func(p)
are 1-dim vectors.
examples/gym/train_trpo_gym.py
Outdated
mean_wscale=0.01, | ||
nonlinearity=F.tanh, | ||
var_type='diagonal', | ||
var_func=lambda x: F.exp(x) ** 2, # Parameterize log std |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F.exp(2 * x)
could be faster.
@testing.parameterize( | ||
*testing.product({ | ||
'n': [1, 5], | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add float32
test (with larger tol
)?
chainerrl/agents/trpo.py
Outdated
|
||
def _hessian_vector_product(flat_grads, params, vec): | ||
"""Compute hessian vector product efficiently by backprop.""" | ||
grads = _chainer_grad_with_zero([F.sum(flat_grads * vec)], params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming all the parameters are used, chainer.grad(outputs, inputs, *args, **kwargs)
just works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We cannot assume that for general hessian vector product, but it's true TRPO doesn't work with unused parameters because of CG. I think chainer.grad
and raising an informative error when there's None would be better. I'll fix it.
tests/agents_tests/test_trpo.py
Outdated
hessian = compute_hessian(y, params) | ||
self.assertEqual(np.count_nonzero(hvp), 0) | ||
self.assertEqual(np.count_nonzero(hessian), 0) | ||
np.testing.assert_allclose(hvp, hessian.dot(vec), atol=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems equivalent to checking shapes, because 0 × 0 = 0.
chainerrl/agents/trpo.py
Outdated
self.policy_step_size_record = collections.deque( | ||
maxlen=policy_step_size_stats_window) | ||
|
||
self.xp = self.policy.xp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's allowed to put policy
and vf
on different devices (cpu / gpu), self.xp
might be confusing.
nonlinearity=F.relu, | ||
mean_wscale=1, | ||
var_func=F.softplus, | ||
var_param_init=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the name var_param_init
suggests the correspondence to the names var_wscale
and var_bias
of FCGaussianPolicy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe var
is more consistent with var_bias
, while less informative. Do you think var
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this param does not represent variance, but it represents values that are converted to variance via var_func
, so I added _param
. I admit it is still confusing, but I didn't come up with a better name. Any suggestion?
I fixed the points you mentioned, except the name of |
The coverage with Chainer v2 decreased while that of v3 increased. I think it is because the tests of TRPO are active only for v3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds
chainerrl.agents.TRPO
, which implements the TRPO-GAE algorithm.Resolves #202
Strangely, chainer==3.0.0 and chainer==3.1.0 have different results. With chainer==3.0.0 line search always backtracks several times, while with chainer==3.1.0 it rarely backtracks. Values of expected improve and KL div are different.This problem is solved. Current TRPO only works with 3.1.0 or later.python examples/gym/train_trpo_gym.py --env Hopper-v1 --gpu -1
chainer==3.0.0
chainer==3.1.0