Skip to content

Possible issue of gradients calculation in actor_critic_cartpole example #194

@refraction-ray

Description

@refraction-ray

In this example https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py, the gradient for the actor is defined as the gradient of loss $L = \sum \ln\pi (reward-value)$.

However, since value is also directly dependent on model.variables, the gradient of such loss is not the gradient from the textbook as $\nabla L = \sum \nabla(\ln\pi) (reward-value)$. For a detailed derivation on the correct gradient formula in this case, see https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/.

Therefore, in the implementation, one should use diff = ret - tf.stop_gradient(value) (L147) instead of plain diff = ret - value to cut down the gradient backpropagation via value. And in experiments, the change can indeed greatly reduce the training steps with the same hyperparams and setup.

Actually the similar example from tensorflow seems to have the same type of issue: https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions