initial

rwachters · Jan 13, 2022 · 04ec5ea · 04ec5ea
commit 04ec5ea
Show file tree

Hide file tree

Showing 25 changed files with 1,945 additions and 0 deletions.
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/Unity ML Agents - Python API - Examples.iml b/.idea/Unity ML Agents - Python API - Examples.iml
diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/other.xml b/.idea/other.xml
diff --git a/.idea/vcs.xml b/.idea/vcs.xml
diff --git a/README.md b/README.md
@@ -0,0 +1,107 @@
+[//]: # (Image References)
+
+[image1]: https://user-images.githubusercontent.com/10624937/42386929-76f671f0-8106-11e8-9376-f17da2ae852e.png "Kernel"
+# Reinforcement Learning Project
+
+This project was created to make it easier to get started with Reinforcement Learning. It now contains: 
+- An implementation of the [DDPG Algorithm](https://arxiv.org/abs/1509.02971) in Python, which works for both single-agent environments and multi-agent environments.
+- Single and parallel environments in [Unity ML agents](https://unity.com/products/machine-learning-agents) using the [Python API](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Python-API.md).
+- Two Jupyter notebooks:
+  - [3DBall.ipynb](notebooks/3DBall.ipynb): This is a simple example to get started with Unity ML Agents & the DDPG Algorithm.
+  - [3DBall_parallel_environment.ipynb](notebooks/3DBall_parallel_environment.ipynb): The same, but now for an environment run in parallel.
+
+# Getting Started
+
+## Install Basic Dependencies
+
+To set up your python environment to run the code in the notebooks, follow the instructions below. 
+
+- If you're on Windows I recommend installing [Miniforge](https://github.com/conda-forge/miniforge). It's a minimal installer for Conda. I also recommend using the [Mamba](https://github.com/mamba-org/mamba) package manager instead of [Conda](https://docs.conda.io/). It works almost the same as Conda, but only faster. There's a [cheatsheet](https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html) of Conda commands which also work in Mamba. To install Mamba, use this command:
+```bash
+conda install mamba -n base -c conda-forge 
+```
+- Create (and activate) a new environment with Python 3.6 or later. I recommend using Python 3.9:
+
+    - __Linux__ or __Mac__:
+    ```bash
+    mamba create --name rl39 python=3.9 numpy
+    source activate rl39
+    ```
+    - __Windows__:
+    ```bash
+    mamba create --name rl39 python=3.9 numpy
+    activate rl39
+    ```
+- Install PyTorch by following instructions on [Pytorch.org](https://pytorch.org/). For example, to install PyTorch on
+   Windows with GPU support, use this command:
+
+```bash
+mamba install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
+```
+
+- Install additional packages:
+```bash
+mamba install jupyter notebook matplotlib
+```
+
+- Create an [IPython kernel](http://ipython.readthedocs.io/en/stable/install/kernel_install.html) for the `rl39` environment in Jupyter.
+
+```bash
+python -m ipykernel install --user --name rl39 --display-name "rl39"
+```
+
+- Change the kernel to match the `rl39` environment by using the drop-down menu `Kernel` -> `Change kernel` inside Jupyter Notebook.
+
+## Install Unity Machine Learning Agents
+
+**Note**: 
+In order to run the notebooks on **Windows**, it's not necessary to install the Unity Editor, because I have provided the [standalone executables](notebooks/README.md) of the environments for you.
+
+[Unity ML Agents](https://unity.com/products/machine-learning-agents) is the software that we use for the environments. The agents that we create in Python can interact with these environments. Unity ML Agents consists of several parts:
+- [The Unity Editor](https://unity.com/) is used for creating environments. To install:
+  - Install [Unity Hub](https://unity.com/download).
+  - Install the latest version of Unity by clicking on the green button `Unity Hub` on the [download page](https://unity3d.com/get-unity/download/archive). 
+  
+  To start the Unity editor you must first have a project:
+     
+   - Start the Unity Hub.
+   - Click on "Projects"
+   - Create a new dummy project.
+   - Click on the project you've just added in the Unity Hub. The Unity Editor should start now.
+
+- [The Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents#unity-ml-agents-toolkit). Download [the latest release](https://github.com/Unity-Technologies/ml-agents/releases) of the source code or use the [Git](https://git-scm.com/downloads/guis) command: `git clone --branch release_18 https://github.com/Unity-Technologies/ml-agents.git`.
+- The Unity ML Agents package is used inside the Unity Editor. Please read [the instructions for installation](https://github.com/Unity-Technologies/ml-agents/blob/release_18_docs/docs/Installation.md#install-the-comunityml-agents-unity-package).
+- The `mlagents` Python package is used as a bridge between Python and the Unity editor (or standalone executable). To install, use this command: `python -m pip install mlagents==0.27.0`.
+Please note that there's no conda package available for this.
+
+## Install an IDE for Python
+
+For Windows, I would recommend using [PyCharm](https://www.jetbrains.com/pycharm/) (my choice), or [Visual Studio Code](https://code.visualstudio.com/).
+Inside those IDEs you can use the Conda environment you have just created.
+
+## Creating a custom Unity executable
+
+### Load the examples project
+[The Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents#unity-ml-agents-toolkit) contains several [example environments](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md). Here we will load them all inside the Unity editor:
+- Start the Unity Hub.
+- Click on "Projects"
+- Add a project by navigating to the `Project` folder inside the toolkit.
+- Click on the project you've just added in the Unity Hub. The Unity Editor should start now.
+
+### Create a 3D Ball executable
+The 3D Ball example contains 12 environments in one, but this doesn't work very well in the Python API. The main problem is that there's no way to reset each environment individually. Therefore, we will remove the other 11 environments in the editor:
+- Load the 3D Ball scene, by going to the project window and navigating to `Examples` -> `3DBall` -> `Scenes`-> `3DBall`
+- In the Hierarchy window select the other 11 3DBall objects and delete them, so that only the `3DBall` object remains.
+
+Next, we will build the executable:
+- Go to `File` -> `Build Settings`
+- In the Build Settings window, click `Build`
+- Navigate to `notebooks` folder and add `3DBall` to the folder name that is used for the build.
+
+
+## Instructions for running the notebooks
+
+1. [Download](notebooks/README.md) the Unity executables for Windows. In case you're not on Windows, you have to build the executables yourself by following the instructions above. 
+2. Place the Unity executable folders in the same folder as the notebooks.
+3. Load a notebook with Jupyter notebook. (The command to start Jupyter notebook is `jupyter notebook`)
+4. Follow further instructions in the notebook.
diff --git a/Report.md b/Report.md
@@ -0,0 +1,37 @@
+[//]: # (Image References)
+
+[image1]: ./plot.png
+
+# Project 3: Collaboration and Competition
+## Learning Algorithm
+The learning algorithm used for this project is [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/abs/1509.02971). DDPG is known as an Actor-Critic method, and it can be used for continuous action spaces. Just like DQN (from project 1) it uses [Experience Replay](https://paperswithcode.com/method/experience-replay) and a [Target Network](https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c). The Actor learns a deterministic policy function, and the Critic learns a Q value function. They both interact with each other when learning. The Critic uses the deterministic action from the Actor when calculating the Q value. Because the Actor learns a deterministic policy, some noise must be added to the action values, to help with exploration. This algorithm uses a noise decay, so that the noise at the start of the learning process is high and much lower at the end of it.
+
+Two types of neural networks are used in this project, one for the Actor and one for the Critic. They both have two hidden layers with 256 and 128 linear units. The Actor network has 24 inputs, and 2 outputs. That's because each state has 24 parameters and there are 2 action parameters. The Critic has 26 (24 + 2) inputs and only one output, the Q value.
+
+In this project there are two agents, so there is an Actor and a Critic neural network for each agent. Both agents learn independently of each other. The Critic only uses the state that the agent sees and not the global state like in the [MADDPG](https://proceedings.neurips.cc/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf) algorithm.
+
+The hyperparameters used for this algorithm are:
+
+- `buffer_size=100000` replay buffer size
+- `batch_size=1000` minibatch size
+- `gamma=0.99` discount factor
+- `tau=1e-3` for soft update of the target network parameters
+- `lr_actor=1e-4` learning rate of the actor
+- `lr_critic=1e-3` learning rate of the critic
+- `weight_decay=0.0` L2 weight decay
+- `update_every=20` how often to update the networks
+- `noise_decay=3e-6` the noise decay used for the action values
+
+## Plot of Rewards
+![plot][image1]
+
+The environment was solved in 23746 episodes.
+
+## Ideas for Future Work
+The performance of the agent could be improved in several ways:
+
+- [MADDPG](https://proceedings.neurips.cc/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf)
+- [Twin Delayed DDPG](https://spinningup.openai.com/en/latest/algorithms/td3.html)
+- [Soft Actor Critic (SAC)](https://spinningup.openai.com/en/latest/algorithms/sac.html)
+- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
+
diff --git a/ddpg_agent.py b/ddpg_agent.py
@@ -0,0 +1,115 @@
+from model import Actor, Critic
+from pytorch_device import pytorch_device
+import torch
+import torch.nn.functional as f
+import torch.optim as optim
+from typing import Tuple, List
+import copy
+
+
+class DDPGAgent:
+    """Interacts with and learns from the environment."""
+
+    def __init__(self, actor: Actor, critic: Critic, gamma=0.99, tau=1e-3,
+                 lr_actor=1e-4, lr_critic=1e-3, weight_decay=1e-2):
+        """Initialize a DDPG Agent object.
+
+            :param actor:
+            :param critic:
+            :param gamma: discount factor
+            :param tau: for soft update of target parameters
+            :param lr_actor: learning rate of the actor
+            :param lr_critic: learning rate of the critic
+            :param weight_decay: L2 weight decay
+        """
+        self.action_size = actor.action_size
+        self.gamma = gamma
+        self.tau = tau
+
+        # Actor Network (w/ Target Network)
+        self.actor = actor.to(pytorch_device)
+        self.actor_target = copy.deepcopy(actor).to(pytorch_device)
+        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor)
+
+        # Critic Network (w/ Target Network)
+        self.critic = critic.to(pytorch_device)
+        self.critic_target = copy.deepcopy(critic).to(pytorch_device)
+        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr_critic, weight_decay=weight_decay)
+
+    def act(self, state) -> torch.Tensor:
+        self.actor.eval()
+        with torch.no_grad():
+            action = self.actor(state)
+        self.actor.train()
+        return action
+
+    def step(self, samples: Tuple[torch.Tensor, ...]):
+        """Update policy and value parameters using given batch of experience tuples.
+                Q_targets = r + γ * critic_target(next_state, actor_target(next_state))
+                where:
+                    actor_target(state) -> action
+                    critic_target(state, action) -> Q-value
+
+                    :param samples: tuple of (s, a, r, s', done)
+                """
+        states, actions, rewards, next_states, dones = samples
+
+        # ---------------------------- update critic ---------------------------- #
+        with torch.no_grad():
+            # Get predicted next-state actions and Q values from target models
+            actions_next = self.actor_target(next_states) # + \
+            #                (torch.rand(*actions.shape, device=pytorch_device) * 0.1 - 0.05)
+            # torch.clamp_(actions_next, min=-1.0, max=1.0)
+            q_targets_next = self.critic_target(next_states, actions_next)
+            # Compute Q targets for current states
+            q_targets = rewards + (self.gamma * q_targets_next * (1 - dones))
+        # Compute critic loss
+        q_expected = self.critic(states, actions)
+        critic_loss = f.mse_loss(q_expected, q_targets)
+        # Minimize the loss
+        self.critic_optimizer.zero_grad()
+        critic_loss.backward()
+        # torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 1)
+        self.critic_optimizer.step()
+
+        # ---------------------------- update actor ---------------------------- #
+        # Compute actor loss
+        actions_pred = self.actor(states) # + \
+        #                (torch.rand(*actions.shape, device=pytorch_device) * 0.1 - 0.05)
+        # torch.clamp_(actions_pred, min=-1.0, max=1.0)
+        actor_loss = -self.critic(states, actions_pred).mean()
+        # Minimize the loss
+        self.actor_optimizer.zero_grad()
+        actor_loss.backward()
+        # torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 1)
+        self.actor_optimizer.step()
+
+    def update_target_networks(self):
+        soft_update(self.critic, self.critic_target, self.tau)
+        soft_update(self.actor, self.actor_target, self.tau)
+
+    def get_state_dicts(self):
+        return {'actor_params': self.actor.state_dict(),
+                'actor_optim_params': self.actor_optimizer.state_dict(),
+                'critic_params': self.critic.state_dict(),
+                'critic_optim_params': self.critic_optimizer.state_dict()}
+
+    def load_state_dicts(self, state_dicts):
+        self.actor.load_state_dict(state_dicts['actor_params'])
+        self.actor_optimizer.load_state_dict(state_dicts['actor_optim_params'])
+        self.critic.load_state_dict(state_dicts['critic_params'])
+        self.critic_optimizer.load_state_dict(state_dicts['critic_optim_params'])
+
+
+def soft_update(local_model, target_model, tau):
+    """Soft update model parameters.
+    θ_target = τ*θ_local + (1 - τ)*θ_target
+
+    Params
+    ======
+        local_model: PyTorch model (weights will be copied from)
+        target_model: PyTorch model (weights will be copied to)
+        tau (float): interpolation parameter
+    """
+    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
+        target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)