Description
Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)
However, I had a doubt about on how the Q(s,a) term of the advantage function is implemented in the library:
rlkit/rlkit/torch/sac/awac_trainer.py
Line 554 in c81509d
Where q1_pred and q2_pred are both directly calculated using the learned Q1 and Q2 functions.
I was wondering about this since, as I understand, the code is using Q(s,a) directly instead of the 1-step returns: r(s,a) + Q(s', a') to compute the Q(s,a) term in the advantage function. It seems to me that there's already have the 1-step returns estimate computed under the variable q_target:
rlkit/rlkit/torch/sac/awac_trainer.py
Line 496 in c81509d
Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns?
Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.
Thanks in advance!
Activity