These implementations are based on Shiyu Zhao's Foundations of Reinforcement Learning book.
This repository contains algorithm applications for the GridWorld problem. And is a comprehensive implementation of reinforcement learning algorithms for solving the Grid World environment.
- Value Iteration - Iterative policy evaluation and improvement
- Policy Iteration - Policy evaluation followed by policy improvement
- First-visit and every-visit Monte Carlo control
- Exploring starts and epsilon-soft policies
- SARSA - On-policy TD control
- Q-Learning - Off-policy TD control
- Deep Q-Networks (DQN) - Neural network based Q-learning
├── configs/ # YAML configuration files
│ ├── run_dqn.yaml
│ ├── run_monte_carlo.yaml
│ ├── run_policy_iteration.yaml
│ ├── run_qlearning.yaml
│ ├── run_sarsa.yaml
│ └── run_value_iteration.yaml
├── src/ # Core source code
│ ├── environment.py # Grid World implementation
│ └── visualizer.py # Visualization utilities
├── solvers/ # RL algorithm implementations
│ ├── value_iteration.py
│ ├── policy_iteration.py
│ ├── monte_carlo.py
│ ├── temporal_difference.py
│ ├── q_learning.py
│ └── deep_q_learning.py
├── reference/ # Reference implementations
├── utils/ # Helper functions
└── main.py # Main entry point
Run an algorithm:
# DQN
python main.py --config configs/run_dqn.yamlConfigure algorithms via YAML files or command line:
# Example configs/run_qlearning.yaml
name: "q_learning_small_gamma"
algorithm: "q_learning"
# Environment params
env:
log_history: 1
size: 5
initial_state: [0, 0]
forbidden_states:
- [1, 1]
- [1, 2]
- [2, 2]
- [3, 1]
- [4, 1]
- [3, 3]
target_state: [3, 2]
reward_target: 0.0
reward_forbidden: -10.0
reward_boundary: -10.0
reward_other: -1.0
qlearning_config:
n_episodes: 500
episode_len: 200
epsilon_decay: "exponential"
epsilon: 0.9
min_epsilon: 0.05
alpha: 0.1- Modular Design: Each algorithm in separate, reusable modules
- Visualization: Real-time grid visualization with Pygame
- Extensible: Easy to add new algorithms or environments
- Configurable: YAML-based configuration for experiments
- Benchmarking: Compare different algorithms on same problems
Each algorithm generates:
- Convergence plots (value/policy convergence)
- Episode reward trends
- Final optimal policy visualization
- Performance metrics (steps per episode, total reward)
Example: main.py --config configs/run_sarsa.yaml, with no tabular approximation,
At the same time the performance is plotted:

Progress for 4300 episodes
At the end the final policy is shown, which for SARSA only shows an "optimal path":

Optimal policy learned
- Python 3.8+
- NumPy
- Matplotlib
- PyTorch
