Possible issue of gradients calculation in actor_critic_cartpole example

In this example https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py, the gradient for the actor is defined as the gradient of loss $L = \sum \ln\pi (reward-value)$.

However, since value is also directly dependent on model.variables, the gradient of such loss is not the gradient from the textbook as $\nabla L = \sum \nabla(\ln\pi) (reward-value)$. For a detailed derivation on the correct gradient formula in this case, see https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/.

 Therefore, in the implementation, one should use ``diff = ret - tf.stop_gradient(value)``  (L147) instead of plain ``diff = ret - value`` to cut down the gradient backpropagation via ``value``. And in experiments, the change can indeed greatly reduce the training steps with the same hyperparams and setup.

Actually the similar example from tensorflow  seems to have the same type of issue: https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible issue of gradients calculation in actor_critic_cartpole example #194

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible issue of gradients calculation in actor_critic_cartpole example #194

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions