Example TRPO implementation with ReLAx
This repository contains an implementation of trust region policy optimization (TRPO) with ReLAx.
TRPO actor was trained on HalfCheetah-v2 Mujoco Gym environment for 4m env-steps.
The graph of average return vs training step is shown below (batch_size=40000
):
Resulting Policy: