You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am looking at the PPO implementation, and I am curious about this part (actually many other implementations are using this workflow as well, so I am also curious to see if I miss anything)
So the action_log_probs is created, removed gradient (by setting requires_gradient=False), and inserted into the storage buffer, this action_log_probs is generated by the following function and then will be referred as old_action_log_probs_batch in PPO
If I am not understanding wrong, evaluate_actions() and act() will output the same action_log_probs because they are using the same actor_critic and calling log_probs(action), the only difference is the old_action_log_probs_batch has the gradient removed, so backpropagation will not go through it.
So my question is, why we bother to save old_action_log_probs_batch in the storage, but instead, something like this can be created on the fly.
In my understanding, the key point is that after sampling trajectories, the agent parameters would be updated several times (it's up to args.ppo_epoch). At the first updating time, the situation is as you said. However, since the second time, the old_action_log_probs in the PPO implementation is calculated based on the original paramenters, while old_action_log_probs in your implementation is calculated based on paramenters that have been updated once.
Hi,
I am looking at the PPO implementation, and I am curious about this part (actually many other implementations are using this workflow as well, so I am also curious to see if I miss anything)
So the
action_log_probs
is created, removed gradient (by settingrequires_gradient=False
), and inserted into the storage buffer, thisaction_log_probs
is generated by the following function and then will be referred asold_action_log_probs_batch
in PPOIn PPO algorithm, the ratio is calculated by the following, the
action_log_probs
is fromevaluate_actions()
If I am not understanding wrong,
evaluate_actions()
andact()
will output the sameaction_log_probs
because they are using the sameactor_critic
and callinglog_probs(action)
, the only difference is theold_action_log_probs_batch
has the gradient removed, so backpropagation will not go through it.So my question is, why we bother to save
old_action_log_probs_batch
in the storage, but instead, something like this can be created on the fly.Thank you for your attention. Look forward to the discussion.
Regards,
Tian
The text was updated successfully, but these errors were encountered: