Some questions about the SafeLOOP algorithm. #294

TomatoZ2 · 2023-12-12T08:35:19Z

TomatoZ2
Dec 12, 2023

Compared to the original paper Learning Off-Policy with Online Planning, I found some implementation differences.

In the original paper, the action sequence is generated by weighting actors and historical distributions, but in omnisafe, the action sequences are generated by actors and historical distributions respectively.
The variance of the sequence generated in the historical distribution is fixed, but in the original text the variance is generated by history, which is fatal when the action is required to be 0. After modification, the actions will be fixed because the initial actions of the agent are blocked.
The action sequences of actors are all generated on one model.
I would like to know what other implementations are different from the original paper and the corresponding reasons, and some changes are negative optimizations for other tasks.

Answered by hdadong

Jan 15, 2024

Apologies for the delayed response, and thank you for your suggestions.

In ARC, the action sequences are generated by actors and historical distributions respectively.
The action sequences generated by actors can be found here: ARC Actors Action Sequence
The action sequences generated by historical distributions are available here: ARC Historical Distributions Action Sequence
Regarding ARC's variable (var), it is indeed fixed in the code implementation. You can view it here: ARC Variable Implementation
You are correct in pointing out that in our implementation, only the first model is used for rollout, whereas ARC employs a random model. This could potentially impact performance. We plan…

View full answer

hdadong · 2024-01-15T11:39:58Z

hdadong
Jan 15, 2024
Maintainer

Apologies for the delayed response, and thank you for your suggestions.

In ARC, the action sequences are generated by actors and historical distributions respectively.
The action sequences generated by actors can be found here: ARC Actors Action Sequence
The action sequences generated by historical distributions are available here: ARC Historical Distributions Action Sequence
Regarding ARC's variable (var), it is indeed fixed in the code implementation. You can view it here: ARC Variable Implementation
You are correct in pointing out that in our implementation, only the first model is used for rollout, whereas ARC employs a random model. This could potentially impact performance. We plan to address this issue in upcoming versions.

To unify different planning algorithms, we simplified the code and converted the original numpy implementation to a torch implementation for improved GPU efficiency. This approach is inspired by the implementation in TDMPC:
TDMPC Implementation
Reference: Hansen N, Wang X, Su H. Temporal Difference Learning for Model Predictive Control. arXiv preprint arXiv:2203.04955, 2022.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the SafeLOOP algorithm. #294

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Some questions about the SafeLOOP algorithm. #294

TomatoZ2 Dec 12, 2023

Replies: 1 comment

hdadong Jan 15, 2024 Maintainer

TomatoZ2
Dec 12, 2023

hdadong
Jan 15, 2024
Maintainer