Solve openai-gym environment with REINFORCE update Environment solved if average score over 100 consecutive episodes is at least 195
REINFORCE.py trains an agent with REINFORCE algorithm, then it prints scores, average score over the past 100 episodes.
Consider the "CartPole-v0" environment. We try to solve this environment with the REINFORCE algorithm (a policy gradient based method). The algorithm uses a stochastic gradient ascent update so that convergence to a local optimum is assured for decreasing alpha. REINFORCE produces slow learning as a Monte Carlo method with potential high variance.
We use a neural network to parametrize the policy. Convergence is not always achieved. Oscillations may appear. The algorithm can solve the environment in 1000-5000 episodes.
When using two hidden layers (fully connected), learning generally fails (see below).
Generally speaking, as stated in Sutton (2018), backpropagation algorithm can produce good results for shallow networks, but it may not work well for deeper networks. A shallow network with one hidden layer (unit number = 4) is used to solve this environment.
Below the result with the PPO algorithm with clipped surrogate objective.