Skip to content

Latest commit

 

History

History
58 lines (30 loc) · 2.15 KB

File metadata and controls

58 lines (30 loc) · 2.15 KB

Examples

Monte Carlo control and Maze2D

We used the classical Monte Carlo method to learn state-action values for a 2D maze environment. The learning was on-policy, we used an epsilon-greedy policy based on the current state-action values, exponentially decaying epsilon as learning progressed to decrease exploration.

The negative rewards per episode are plotted:

reward line plot

The final, near optimal, action selection by the greedy policy is shown below:

maze policy

The experiment can be reproduced by running:

python examples.py maze_montecarlo

Edit the file to change the maze setup or learning parameters.


Tuning SARSA hyperparameters with Optuna

The goal of this experiment was to tune the alpha and gamma parameters for SARSA on Maze2D. We optimized for KPI that incorporates both the final greedy policy reward and the speed of learning for the final policy. For efficiency, we limit terminate the episodes (if necessary) after 100 steps (the optimal policy ends in 34 steps).

The plot below shows the contour plot for the tested hyperparameters:

HP tuning contour

The following two figures show the rewards during learning and final policy for the best parameters:

reward line plot

It is evident that SARSA is much faster than MC in learning and the final policy is overall better:

best policy

The experiment can be reproduced by running:

python examples.py sarsa_hyperparameter

Learning Nash Equilibria

This standalone notebook explores using gradient descent for learning optimal mixing strategies for 2-person 2-action games.

Interesting dynamics can be observed such as when the Nash-equilibrium is repelling (Bailey and Piliouras [2018]).

repelling equilibria