Expanded DQN tutorial (pytorch#4)

Kaixhin · apaszke · commit ceef522a5c50 · 2017-01-17T13:34:12.000+01:00
diff --git a/Reinforcement (Q-)Learning with PyTorch.ipynb b/Reinforcement (Q-)Learning with PyTorch.ipynb
@@ -6,17 +6,19 @@
    "source": [
     "# PyTorch DQN tutorial\n",
     "\n",
-    "This tutorial shows how to use pytorch to train an DQN agent on a CartPole-v0 task from Open AI gym.\n",
+    "This tutorial shows how to use PyTorch to train a DQN agent on the CartPole-v0 task from the [OpenAI Gym](https://gym.openai.com/).\n",
     "\n",
     "### Task\n",
     "\n",
-    "The agent has to decide to move the cart left or right, so that the pole attached to it stays upright. You can find an official board with various algorithms and visualizations [at the AI Gym website](https://gym.openai.com/envs/CartPole-v0).\n",
+    "The agent has to decide between two actions - moving the cart left or right - so that the pole attached to it stays upright. You can find an official leaderboard with various algorithms and visualizations at the [Gym website](https://gym.openai.com/envs/CartPole-v0).\n",
     "\n",
     "![cartpole](images/cartpole.gif)\n",
     "\n",
-    "This task is designed so that the input are 4 real values representing the environment state (accelerations, etc.). However this seems a bit boring, so we'll use a screen patch centered on the cart as an input. Because of this, our results aren't directly comparabe to the ones from an official leaderboard - our task is harder.\n",
+    "As the agent observes the current state of the environment and chooses an action, the environment *transitions* to a new state, and also returns a reward that indicates the consequences of the action. In this task, the environment terminates if the pole falls over too far.\n",
     "\n",
-    "This unfortunately slows down the training, because we have to render all the frames."
+    "The CartPole task is designed so that the inputs to the agent are 4 real values representing the environment state (position, velocity, etc.). However, neural networks can solve the task purely by looking at the scene, so we'll use a patch of the screen centered on the cart as an input. Because of this, our results aren't directly comparable to the ones from the official leaderboard - our task is much harder. Unfortunately this does slow down the training, because we have to render all the frames.\n",
+    "\n",
+    "Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the pole into account from one image."
    ]
   },
   {
@@ -25,12 +27,12 @@
    "source": [
     "### Packages\n",
     "\n",
-    "First, let's import needed packages. From PyTorch, we'll use:\n",
+    "First, let's import needed packages. Firstly, we need [`gym`](https://gym.openai.com/docs) for the environment. We'll also use the following from PyTorch:\n",
     "\n",
-    "* neural network package (`torch.nn`)\n",
-    "* optimization package (`torch.optim`)\n",
-    "* automatic differentiation package (`torch.autograd`)\n",
-    "* package with utilities for vision tasks (`torch_vision`)."
+    "* neural networks (`torch.nn`)\n",
+    "* optimization (`torch.optim`)\n",
+    "* automatic differentiation (`torch.autograd`)\n",
+    "* utilities for vision tasks (`torchvision` - [a separate package](https://github.com/pytorch/vision))."
    ]
   },
   {
@@ -76,9 +78,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Replay memory\n",
+    "### Replay Memory\n",
     "\n",
-    "We'll be using experience replay for training our DQN. It allows us to reuse the data we observed earlier and sample from it randomly, so the transitions that build up a batch are decorelated. It has been shown that this greately stabilizes and improves the DQN training procedure.\n",
+    "We'll be using experience replay memory for training our DQN. It stores the transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.\n",
     "\n",
     "For this, we're going to need two classses:\n",
     "\n",
@@ -125,26 +127,34 @@
     "\n",
     "### DQN algorithm\n",
     "\n",
-    "Our world is deterministic, so all equations presented here are also assuming determinism of the process.\n",
+    "Our environment is deterministic, so all equations presented here are also formulated deterministically for the sake of simplicity. In the reinforcement learning literature, they would also contain expectations over stochastic transitions in the environment.\n",
     "\n",
-    "Our aim will be to train a policy that tries to maximize the discounted reward $R_{t_0} = \\sum_{t=t_0}^{\\infty} r_t \\gamma^{t - t_0}$. $\\gamma$ should be a constant between $0$ and $1$ that ensures the sum converges, and makes the rewards from the uncertain far future be less important for our agent than the ones it can be fairly confident about.\n",
+    "Our aim will be to train a policy that tries to maximize the discounted, cumulative reward $R_{t_0} = \\sum_{t=t_0}^{\\infty} \\gamma^{t - t_0} r_t$, where $R_{t_0}$ is also known as the *return*. The discount, $\\gamma$, should be a constant between $0$ and $1$ that ensures the sum converges. It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about.\n",
     "\n",
-    "The main idea behind Q-learning is that if we had a function $Q^*: State \\times Action \\rightarrow \\mathbb{R}$, that could tell us what would our discounted reward be, if we were to take an action in a given state, we could easily construct a policy that miximizes our rewards:\n",
+    "The main idea behind Q-learning is that if we had a function $Q^*: State \\times Action \\rightarrow \\mathbb{R}$, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:\n",
     "\n",
-    "$$\\pi^*(s) = \\mathrm{argmax}_a \\ Q^*(s, a)$$\n",
+    "$$\\pi^*(s) = \\arg\\!\\max_a \\ Q^*(s, a)$$\n",
     "\n",
-    "However, we don't know everything about the world, so we don't have access to $Q^*$, but since neural networks are universal function approximators, we can simply create one and train it to resemble the $Q^*$.\n",
+    "However, we don't know everything about the world, so we don't have access to $Q^*$. But, since neural networks are universal function approximators, we can simply create one and train it to resemble $Q^*$.\n",
     "\n",
-    "For our training update rule, we'll use a fact that every $Q$ function for some policy obeys the Bellman equation.\n",
+    "For our training update rule, we'll use a fact that every $Q$ function for some policy obeys the Bellman equation:\n",
     "\n",
     "$$Q^{\\pi}(s, a) = r + \\gamma Q^{\\pi}(s', \\pi(s'))$$\n",
     "\n",
-    "Our loss will be a mean squared error between the two sides of the equality (where $B$ is a batch of transitions):\n",
-    "$$L = \\frac{1}{|B|}\\sum_{(s, a, s', r) \\ \\in \\ B} (Q(s, a) - (r + \\gamma \\max_a Q(s', a)))^2$$\n",
+    "The difference between the two sides of the equality is known as the temporal difference error, $\\delta$:\n",
+    "\n",
+    "$$\\delta = Q(s, a) - (r + \\gamma \\max_a Q(s', a))$$\n",
+    "\n",
+    "To minimise this error, we will use the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss). The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large - this makes it more robust to outliers when the estimates of $Q$ are very noisy. We calculate this over a batch of transitions, $B$, sampled from the replay memory:\n",
+    "\n",
+    "$$\\mathcal{L} = \\frac{1}{|B|}\\sum_{(s, a, s', r) \\ \\in \\ B} \\mathcal{L}(\\delta) \\quad \\text{where} \\quad \\mathcal{L}(\\delta) = \\begin{cases}\n",
+    "  \\frac{1}{2}{\\delta^2}  & \\text{for } |\\delta| \\le 1, \\\\\n",
+    "  |\\delta| - \\frac{1}{2} & \\text{otherwise.}\n",
+    "\\end{cases}$$\n",
     "\n",
     "### Q-network\n",
     "\n",
-    "Our model will be a CNN that takes in a difference between the current screen patch, and the previous one. This will allow it to take the velocity of the pole into account. It has two outputs representing $Q(s, \\mathrm{left})$ and $Q(s, \\mathrm{right})$ (where $s$ is the input to the network)."
+    "Our model will be a convolutional neural network that takes in the difference between the current and previous screen patches. It has two outputs, representing $Q(s, \\mathrm{left})$ and $Q(s, \\mathrm{right})$ (where $s$ is the input to the network). In effect, the network is trying to predict the *quality* of taking each action given the current input."
    ]
   },
   {
@@ -179,7 +189,7 @@
    "source": [
     "### Input extraction\n",
     "\n",
-    "The code below are utilities for extracting and processing rendered images from the env. It uses the `torch_vision` package, that makes it easy to compose image transforms. Once you run the cell it will display an example patch it extracted."
+    "The code below are utilities for extracting and processing rendered images from the environment. It uses the `torchvision` package, which makes it easy to compose image transforms. Once you run the cell it will display an example patch that it extracted."
    ]
   },
   {
@@ -303,9 +313,9 @@
     "\n",
     "Finally, the code for training our model.\n",
     "\n",
-    "At the top you can find an `optimize_model` function that performs a single step of the optimization. It first samples a batch, concatenates all the tensors into a single one, computes $Q(s_t, a_t)$ and $V(s_{t+1}) = \\max_a Q(s_{t+1}, a)$, and combines them into our loss. There's some complication because of the final states, for which $V(s) = 0$.\n",
+    "At the top you can find an `optimize_model` function that performs a single step of the optimization. It first samples a batch, concatenates all the tensors into a single one, computes $Q(s_t, a_t)$ and $V(s_{t+1}) = \\max_a Q(s_{t+1}, a)$, and combines them into our loss. By defition we set $V(s) = 0$ if $s$ is a terminal state.\n",
     "\n",
-    "Below, you can find the main training loop. At the beginning we reset the env and initialize the `state` variable. Then, we sample an action, execute it, observe the next screen and the reward (always 1), and optimize our model once. When the episode ends (our model fails), we restart the loop."
+    "Below, you can find the main training loop. At the beginning we reset the environment and initialize the `state` variable. Then, we sample an action, execute it, observe the next screen and the reward (always 1), and optimize our model once. When the episode ends (our model fails), we restart the loop."
    ]
   },
   {