This is an attempt to implement Generative Adversarial Imitation Learning (GAIL) for deterministic policies with off Policy learning on static data. The policy never interacts with the environment (except for evaluation), instead it is trained on policy state-action pair, where policy only selects actions for states sampled from expert data.
Although it works sometimes (depending on the type of environment), the algorithm has high variance, and the results are inconsistent.
Expert Policy | Recovered Policy (10 expert episodes) |
---|---|
Epochs vs rewards |
---|