-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
In this example https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py, the gradient for the actor is defined as the gradient of loss
However, since value is also directly dependent on model.variables, the gradient of such loss is not the gradient from the textbook as
Therefore, in the implementation, one should use diff = ret - tf.stop_gradient(value)
(L147) instead of plain diff = ret - value
to cut down the gradient backpropagation via value
. And in experiments, the change can indeed greatly reduce the training steps with the same hyperparams and setup.
Actually the similar example from tensorflow seems to have the same type of issue: https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic.