Reinforcement Learning (based on the book Understanding Deep Learning by Simon J.D. Prince[1])

This repository contains a project for the seminar 'Understanding Deep Learning' by Lukas Niehaus in the summer term 2025 at the University of Osnabrück.

Overview

Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by interacting with an environment and recieves rewards for good behaviour and penalties for bad ones.

Chess therefore is a great option for reinforcement learning, because

it has clear rules and goals
every move affects future possibilities
it allows the agent to explore, learn from mistakes, and improve over time.

The aim of this project is to develop reinforcement learning agents for bullet chess, where each player has only 60 seconds total to complete the game. Unlike traditional chess agents that focus solely on move quality, our agents are designed to also manage time pressure, learning to make decisions that are not only strong but also fast when required.

We are going to use DQN and policy-value with alpha-zero style training for our agents. You can play against any of these, if you please.

QuickStart

To start of quickly and be able to execute the code properly, follow this guide.

Fist we need to install Git to be able to clone this repository. Then decide, whether you want to set up your virtual environment with venv (built into Python) or Conda (a package and environment manager from Anaconda/Miniconda).

Install Git

Download and install Git:

Visit the official Git website to download the latest version of Git.
Follow the installation instructions for your operating system.

Clone the Git Repository

Open a terminal or command prompt.
Go to the directory where you want to store everything regarding the course:

cd <directory_name>

Clone the Git repository:

git clone https://github.com/lahellmann/UDL-Reinforcement-Learning

Change into the cloned repository:

cd UDL-Reinforcement-Learning

Set Up a Virtual Environment (venv)

Download and install Python:

Visit the official Python website to download the latest version of Python.
During installation, make sure to check the option that adds Python to your system's PATH.
Create a virtual environment:

python -m venv venv

Activate the virtual environment: --> On Windows:

.\venv\Scripts\activate

--> On Unix or MacOS:

source venv/bin/activate

Install required packages

pip install -r requirements.txt

Set Up a Virtual Environment (conda)

Create a virtual environment:

Open your terminal (Command Prompt on Windows, Terminal on macOS/Linux).
Navigate to the directory where you saved the environment.yml file. (This should be YOUR_PATH/UDL-Reinforcement-Learning/)
Execute the following command to create the environment:

conda create -m venv -f environment.yml

Activate the virtual environment: --> On Windows, Unix and MacOS:

conda activate venv

Install required packages

pip install -r requirements.txt

Structure of this Repository

.
├── .gitignore
├── environment.yml
├── main.ipynb
├── README.md
├── requirements.txt
├── models/
└── src/
    ├── agent_dqn.py
    ├── agent_policy_value.py
    ├── ddqn_agent.py
    ├── debug.py
    ├── environment.py
    ├── play.py
    ├── policyvalue_agent.py
    ├── training.py
    └── utils.py

models/
Contains our trained model files
src/
Contains core source code files for agents, environment, training, and utilities.

The two Agents

DDQN(Double Deep Q-Network)

DDQN is part of the family of value-based reinforcement learning algorithms. It extends the classical Deep Q-Network (DQN) by addressing the problem of overestimation in action-value estimates. In standard DQN, the same network is used to both select and evaluate the best action, which can lead to optimistic value estimates. DDQN mitigates this by decoupling action selection from action evaluation. Core components:

Online Q-network: predicts Q-values for the current state and is updated every training step.
Target Q-network: a delayed copy of the online network used to evaluate the Q-value of the next state, updated less frequently (e.g. every few hundred steps) to stabilize learning.
Replay buffer: stores experiences of the form (state, action, reward, next state, done) to break temporal correlations and allow for more sample-efficient learning.
Epsilon-greedy policy: balances exploration and exploitation during training by sometimes selecting random actions.

Basic idea:

1. Start an episode at initial state s
2. Choose an action a using an ε-greedy policy derived from the online Q-network
3. Take action a in the environment
4. Observe reward r and next state s′
5. Choose next action a′ using the online network, but evaluate it using the target Q-network
6. Compute target value:
    y=r+γQtarget(s′,arg⁡max⁡aQonline(s′,a))y=r+γQtarget(s′,argmaxaQonline(s′,a))
7. Update Q-network by minimizing the loss between current Q-value and target y
8. Set s ← s′, a ← a′, and repeat until the episode ends
9. Periodically, copy the online network weights into the target network

Continue training until a convergence threshold is reached or a set number of episodes complete
The policy is derived from the trained Q-values:
π(s)=arg⁡max⁡aQ(s,a)π(s)=argmaxaQ(s,a)

Policy-Value with AlphaZero-style training

This agent is based on AlphaZero, a powerful reinforcement learning framework that combines policy and value learning with Monte Carlo Tree Search (MCTS). Unlike DDQN, which estimates action-values directly, AlphaZero learns a policy (which actions to prefer) and a value function (how good a position is) using a shared neural network.

Instead of learning purely from rewards after taking actions, the agent improves itself through self-play guided by MCTS, which explores and refines the policy using simulations from the neural network[2].

Core components:

Policy-Value Neural Network: Given a state (e.g. a chess board), it outputs:
- A policy (π) — a probability distribution over legal actions
- A value (v) — the expected game outcome from the current position
Monte Carlo Tree Search:
- Uses the policy and value predictions to guide exploration of possible future states.
- Outputs improved target policy (based on visit counts) and simulated value.
Replay buffer: stores (state, π, value) triplets collected during self-play games.
Training loop: alternates between self-play and training using stored experiences.

Basic idea

Start an episode (a self-play game) at initial state s

At each move:
a. Use Monte Carlo Tree Search (MCTS) guided by the current policy-value network
b. MCTS returns a probability distribution π over legal actions
c. Choose action a by sampling from π (adds exploration)
d. Record training tuple: (state s, policy π, current player)
e. Take action a → observe next state s′
f. Set s ← s′, and continue until the game ends

After game ends, assign game result z (+1 win, -1 loss, 0 draw)

For each stored tuple (s, π, player), compute value = z * player_sign
Add (s, π, value) to the replay buffer

After collecting enough games:
a. Sample mini-batches from the buffer
b. Perform gradient descent to minimize loss:
    Policy loss: cross-entropy between predicted and target π
    Value loss: mean squared error between predicted and true value
Repeat the self-play → training cycle

Continue until the policy converges or a preset training limit is reached
The learned policy is directly given by the network’s output distribution over actions

References

[1] S. J. D. Prince, Understanding Deep Learning. The MIT Press, 2023. website to book [2] Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). [https://doi.org/10.1038/nature2427]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reinforcement Learning (based on the book Understanding Deep Learning by Simon J.D. Prince[1])

Table of Contents

Overview

QuickStart

Install Git

Clone the Git Repository

Set Up a Virtual Environment (venv)

Set Up a Virtual Environment (conda)

Structure of this Repository

The two Agents

DDQN(Double Deep Q-Network)

Policy-Value with AlphaZero-style training

References

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
models		models
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.ipynb		main.ipynb
requirements.txt		requirements.txt

lahellmann/UDL-Reinforcement-Learning

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning (based on the book Understanding Deep Learning by Simon J.D. Prince[1])

Table of Contents

Overview

QuickStart

Install Git

Clone the Git Repository

Set Up a Virtual Environment (venv)

Set Up a Virtual Environment (conda)

Structure of this Repository

The two Agents

DDQN(Double Deep Q-Network)

Policy-Value with AlphaZero-style training

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages