Doubt on advantage calculation to update the policy on AWAC.

Hey, this code and the AWAC paper are awesome! thanks for sharing this library; I've been reading some of it lately trying to understand and apply the AWAC paper:)

However, I had a doubt about on how the Q(s,a) term of the advantage function is implemented in the library:
https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L554
Where q1_pred and q2_pred are both directly calculated using the learned Q1 and Q2 functions.

I was wondering about this since, as I understand, the code is using Q(s,a) directly instead of the 1-step returns: r(s,a) + Q(s', a') to compute the Q(s,a) term in the advantage function. It seems to me that there's already have the 1-step returns estimate computed under the variable q_target:
https://github.com/rail-berkeley/rlkit/blob/c81509d982b4d52a6239e7bfe7d2540e3d3cd986/rlkit/torch/sac/awac_trainer.py#L496

Is there any reason for the use of Q(s,a) directly obtained from doing min(Q1(s,a), Q2(s,a)) functions instead of the 1-step returns as r(s,a) + min(Q1(s',a'), Q2(s', a')); couldn't that introduce more bias to the estimate of the returns? 

Oh, also, I did some tests switching it to the latter implementation on my implementation on a different problem and the results were very similar so it's unclear to me if there's actually any benefit from switching from one implementation to the other.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubt on advantage calculation to update the policy on AWAC. #160

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Doubt on advantage calculation to update the policy on AWAC. #160

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions