Skip to content

Commit

Permalink
initial
Browse files Browse the repository at this point in the history
initial
  • Loading branch information
rwachters committed Jan 13, 2022
0 parents commit 04ec5ea
Show file tree
Hide file tree
Showing 25 changed files with 1,945 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions .idea/Unity ML Agents - Python API - Examples.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

31 changes: 31 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/other.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

107 changes: 107 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
[//]: # (Image References)

[image1]: https://user-images.githubusercontent.com/10624937/42386929-76f671f0-8106-11e8-9376-f17da2ae852e.png "Kernel"
# Reinforcement Learning Project

This project was created to make it easier to get started with Reinforcement Learning. It now contains:
- An implementation of the [DDPG Algorithm](https://arxiv.org/abs/1509.02971) in Python, which works for both single-agent environments and multi-agent environments.
- Single and parallel environments in [Unity ML agents](https://unity.com/products/machine-learning-agents) using the [Python API](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Python-API.md).
- Two Jupyter notebooks:
- [3DBall.ipynb](notebooks/3DBall.ipynb): This is a simple example to get started with Unity ML Agents & the DDPG Algorithm.
- [3DBall_parallel_environment.ipynb](notebooks/3DBall_parallel_environment.ipynb): The same, but now for an environment run in parallel.

# Getting Started

## Install Basic Dependencies

To set up your python environment to run the code in the notebooks, follow the instructions below.

- If you're on Windows I recommend installing [Miniforge](https://github.com/conda-forge/miniforge). It's a minimal installer for Conda. I also recommend using the [Mamba](https://github.com/mamba-org/mamba) package manager instead of [Conda](https://docs.conda.io/). It works almost the same as Conda, but only faster. There's a [cheatsheet](https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html) of Conda commands which also work in Mamba. To install Mamba, use this command:
```bash
conda install mamba -n base -c conda-forge
```
- Create (and activate) a new environment with Python 3.6 or later. I recommend using Python 3.9:

- __Linux__ or __Mac__:
```bash
mamba create --name rl39 python=3.9 numpy
source activate rl39
```
- __Windows__:
```bash
mamba create --name rl39 python=3.9 numpy
activate rl39
```
- Install PyTorch by following instructions on [Pytorch.org](https://pytorch.org/). For example, to install PyTorch on
Windows with GPU support, use this command:

```bash
mamba install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
```

- Install additional packages:
```bash
mamba install jupyter notebook matplotlib
```

- Create an [IPython kernel](http://ipython.readthedocs.io/en/stable/install/kernel_install.html) for the `rl39` environment in Jupyter.

```bash
python -m ipykernel install --user --name rl39 --display-name "rl39"
```

- Change the kernel to match the `rl39` environment by using the drop-down menu `Kernel` -> `Change kernel` inside Jupyter Notebook.

## Install Unity Machine Learning Agents

**Note**:
In order to run the notebooks on **Windows**, it's not necessary to install the Unity Editor, because I have provided the [standalone executables](notebooks/README.md) of the environments for you.
[Unity ML Agents](https://unity.com/products/machine-learning-agents) is the software that we use for the environments. The agents that we create in Python can interact with these environments. Unity ML Agents consists of several parts:
- [The Unity Editor](https://unity.com/) is used for creating environments. To install:
- Install [Unity Hub](https://unity.com/download).
- Install the latest version of Unity by clicking on the green button `Unity Hub` on the [download page](https://unity3d.com/get-unity/download/archive).
To start the Unity editor you must first have a project:
- Start the Unity Hub.
- Click on "Projects"
- Create a new dummy project.
- Click on the project you've just added in the Unity Hub. The Unity Editor should start now.

- [The Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents#unity-ml-agents-toolkit). Download [the latest release](https://github.com/Unity-Technologies/ml-agents/releases) of the source code or use the [Git](https://git-scm.com/downloads/guis) command: `git clone --branch release_18 https://github.com/Unity-Technologies/ml-agents.git`.
- The Unity ML Agents package is used inside the Unity Editor. Please read [the instructions for installation](https://github.com/Unity-Technologies/ml-agents/blob/release_18_docs/docs/Installation.md#install-the-comunityml-agents-unity-package).
- The `mlagents` Python package is used as a bridge between Python and the Unity editor (or standalone executable). To install, use this command: `python -m pip install mlagents==0.27.0`.
Please note that there's no conda package available for this.
## Install an IDE for Python
For Windows, I would recommend using [PyCharm](https://www.jetbrains.com/pycharm/) (my choice), or [Visual Studio Code](https://code.visualstudio.com/).
Inside those IDEs you can use the Conda environment you have just created.
## Creating a custom Unity executable
### Load the examples project
[The Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents#unity-ml-agents-toolkit) contains several [example environments](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md). Here we will load them all inside the Unity editor:
- Start the Unity Hub.
- Click on "Projects"
- Add a project by navigating to the `Project` folder inside the toolkit.
- Click on the project you've just added in the Unity Hub. The Unity Editor should start now.

### Create a 3D Ball executable
The 3D Ball example contains 12 environments in one, but this doesn't work very well in the Python API. The main problem is that there's no way to reset each environment individually. Therefore, we will remove the other 11 environments in the editor:
- Load the 3D Ball scene, by going to the project window and navigating to `Examples` -> `3DBall` -> `Scenes`-> `3DBall`
- In the Hierarchy window select the other 11 3DBall objects and delete them, so that only the `3DBall` object remains.

Next, we will build the executable:
- Go to `File` -> `Build Settings`
- In the Build Settings window, click `Build`
- Navigate to `notebooks` folder and add `3DBall` to the folder name that is used for the build.


## Instructions for running the notebooks

1. [Download](notebooks/README.md) the Unity executables for Windows. In case you're not on Windows, you have to build the executables yourself by following the instructions above.
2. Place the Unity executable folders in the same folder as the notebooks.
3. Load a notebook with Jupyter notebook. (The command to start Jupyter notebook is `jupyter notebook`)
4. Follow further instructions in the notebook.
37 changes: 37 additions & 0 deletions Report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[//]: # (Image References)

[image1]: ./plot.png

# Project 3: Collaboration and Competition
## Learning Algorithm
The learning algorithm used for this project is [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/abs/1509.02971). DDPG is known as an Actor-Critic method, and it can be used for continuous action spaces. Just like DQN (from project 1) it uses [Experience Replay](https://paperswithcode.com/method/experience-replay) and a [Target Network](https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c). The Actor learns a deterministic policy function, and the Critic learns a Q value function. They both interact with each other when learning. The Critic uses the deterministic action from the Actor when calculating the Q value. Because the Actor learns a deterministic policy, some noise must be added to the action values, to help with exploration. This algorithm uses a noise decay, so that the noise at the start of the learning process is high and much lower at the end of it.

Two types of neural networks are used in this project, one for the Actor and one for the Critic. They both have two hidden layers with 256 and 128 linear units. The Actor network has 24 inputs, and 2 outputs. That's because each state has 24 parameters and there are 2 action parameters. The Critic has 26 (24 + 2) inputs and only one output, the Q value.

In this project there are two agents, so there is an Actor and a Critic neural network for each agent. Both agents learn independently of each other. The Critic only uses the state that the agent sees and not the global state like in the [MADDPG](https://proceedings.neurips.cc/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf) algorithm.

The hyperparameters used for this algorithm are:

- `buffer_size=100000` replay buffer size
- `batch_size=1000` minibatch size
- `gamma=0.99` discount factor
- `tau=1e-3` for soft update of the target network parameters
- `lr_actor=1e-4` learning rate of the actor
- `lr_critic=1e-3` learning rate of the critic
- `weight_decay=0.0` L2 weight decay
- `update_every=20` how often to update the networks
- `noise_decay=3e-6` the noise decay used for the action values

## Plot of Rewards
![plot][image1]

The environment was solved in 23746 episodes.

## Ideas for Future Work
The performance of the agent could be improved in several ways:

- [MADDPG](https://proceedings.neurips.cc/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf)
- [Twin Delayed DDPG](https://spinningup.openai.com/en/latest/algorithms/td3.html)
- [Soft Actor Critic (SAC)](https://spinningup.openai.com/en/latest/algorithms/sac.html)
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)

115 changes: 115 additions & 0 deletions ddpg_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from model import Actor, Critic
from pytorch_device import pytorch_device
import torch
import torch.nn.functional as f
import torch.optim as optim
from typing import Tuple, List
import copy


class DDPGAgent:
"""Interacts with and learns from the environment."""

def __init__(self, actor: Actor, critic: Critic, gamma=0.99, tau=1e-3,
lr_actor=1e-4, lr_critic=1e-3, weight_decay=1e-2):
"""Initialize a DDPG Agent object.
:param actor:
:param critic:
:param gamma: discount factor
:param tau: for soft update of target parameters
:param lr_actor: learning rate of the actor
:param lr_critic: learning rate of the critic
:param weight_decay: L2 weight decay
"""
self.action_size = actor.action_size
self.gamma = gamma
self.tau = tau

# Actor Network (w/ Target Network)
self.actor = actor.to(pytorch_device)
self.actor_target = copy.deepcopy(actor).to(pytorch_device)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor)

# Critic Network (w/ Target Network)
self.critic = critic.to(pytorch_device)
self.critic_target = copy.deepcopy(critic).to(pytorch_device)
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr_critic, weight_decay=weight_decay)

def act(self, state) -> torch.Tensor:
self.actor.eval()
with torch.no_grad():
action = self.actor(state)
self.actor.train()
return action

def step(self, samples: Tuple[torch.Tensor, ...]):
"""Update policy and value parameters using given batch of experience tuples.
Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
where:
actor_target(state) -> action
critic_target(state, action) -> Q-value
:param samples: tuple of (s, a, r, s', done)
"""
states, actions, rewards, next_states, dones = samples

# ---------------------------- update critic ---------------------------- #
with torch.no_grad():
# Get predicted next-state actions and Q values from target models
actions_next = self.actor_target(next_states) # + \
# (torch.rand(*actions.shape, device=pytorch_device) * 0.1 - 0.05)
# torch.clamp_(actions_next, min=-1.0, max=1.0)
q_targets_next = self.critic_target(next_states, actions_next)
# Compute Q targets for current states
q_targets = rewards + (self.gamma * q_targets_next * (1 - dones))
# Compute critic loss
q_expected = self.critic(states, actions)
critic_loss = f.mse_loss(q_expected, q_targets)
# Minimize the loss
self.critic_optimizer.zero_grad()
critic_loss.backward()
# torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 1)
self.critic_optimizer.step()

# ---------------------------- update actor ---------------------------- #
# Compute actor loss
actions_pred = self.actor(states) # + \
# (torch.rand(*actions.shape, device=pytorch_device) * 0.1 - 0.05)
# torch.clamp_(actions_pred, min=-1.0, max=1.0)
actor_loss = -self.critic(states, actions_pred).mean()
# Minimize the loss
self.actor_optimizer.zero_grad()
actor_loss.backward()
# torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1)
self.actor_optimizer.step()

def update_target_networks(self):
soft_update(self.critic, self.critic_target, self.tau)
soft_update(self.actor, self.actor_target, self.tau)

def get_state_dicts(self):
return {'actor_params': self.actor.state_dict(),
'actor_optim_params': self.actor_optimizer.state_dict(),
'critic_params': self.critic.state_dict(),
'critic_optim_params': self.critic_optimizer.state_dict()}

def load_state_dicts(self, state_dicts):
self.actor.load_state_dict(state_dicts['actor_params'])
self.actor_optimizer.load_state_dict(state_dicts['actor_optim_params'])
self.critic.load_state_dict(state_dicts['critic_params'])
self.critic_optimizer.load_state_dict(state_dicts['critic_optim_params'])


def soft_update(local_model, target_model, tau):
"""Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Params
======
local_model: PyTorch model (weights will be copied from)
target_model: PyTorch model (weights will be copied to)
tau (float): interpolation parameter
"""
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)
Loading

0 comments on commit 04ec5ea

Please sign in to comment.