Skip to content

Commit d13e8c0

Browse files
authored
Team C: wiki - Intro to RL (RoboticsKnowledgebase#88)
* Intro to RL * Update * Add the required sections
1 parent ad1606f commit d13e8c0

File tree

2 files changed

+122
-1
lines changed

2 files changed

+122
-1
lines changed

_data/navigation.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,8 @@ wiki:
104104
- title: Pure Pursuit Controller for Skid Steering
105105
url: /wiki/actuation/Pure-Pursuit-Controller-for-Skid-Steering-Robot.md
106106
- title: Machine Learning
107-
url: /wiki/machine-learning/
107+
url: /wiki/machine-learning/custom-semantic-data
108+
url: /wiki/machine-learning/intro-to-rl
108109
children:
109110
- title: Custom data-set for segmentation
110111
url: /wiki/machine-learning/custom-semantic-data/

wiki/machine-learning/intro-to-rl.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
# Jekyll 'Front Matter' goes here. Most are set by default, and should NOT be
3+
# overwritten except in special circumstances.
4+
# You should set the date the article was last updated like this:
5+
date: 2020-12-06 # YYYY-MM-DD
6+
# This will be displayed at the bottom of the article
7+
# You should set the article's title:
8+
title: Introduction to Reinforcement Learning
9+
# The 'title' is automatically displayed at the top of the page
10+
# and used in other parts of the site.
11+
12+
---
13+
The goal of Reinforcement Learning (RL) is to learn a good strategy for the agent from experimental trials and relative simple feedback received. With the optimal strategy, the agent is capable to actively adapt to the environment to maximize future rewards.
14+
15+
## Key Concepts
16+
17+
### Bellman Equations
18+
19+
Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values.
20+
$$\begin{aligned}
21+
V(s) &= \mathbb{E}[G_t | S_t = s]\\
22+
&= \mathbb{E}[R_{t+1} + \gamma V(S_{t+1})|S_t = s)]\\
23+
Q(s,a)&=\mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s, A_t = a]
24+
\end{aligned}$$
25+
26+
#### Bellman Expectation Equations
27+
28+
$$\begin{aligned}
29+
V_{\pi}(s) &= \sum_a \pi(a|s)\sum_{s',r} p(s', r | s, a)[r + \gamma V_{\pi}(s')]\\
30+
Q_\pi(s, a) &= \sum_{s'}\sum_{r}p(s', r | s, a)[r +\gamma\sum_{a'}\pi(a', s')Q_\pi(s', a')]
31+
\end{aligned}
32+
$$
33+
34+
#### Bellman Optimality Equations
35+
36+
$$\begin{aligned}
37+
V_*(s) &= \max_{a}\sum_{s'}\sum_{r}p(s', r | s, a)[r + \gamma V_*(s')]\\
38+
Q_*(s,a) &= \sum_{s'}\sum_{r}p(s', r | s, a)[r +\gamma\max_{a'}Q_*(s', a')]
39+
\end{aligned} $$
40+
41+
## Approaches
42+
43+
### Dynamic Programming
44+
45+
When the model of the environment is known, following Bellman equations, we can use Dynamic Programming (DP) to iteratively evaluate value functions and improve policy.
46+
47+
#### Policy Evaluation
48+
49+
$$
50+
V_{t+1} = \mathbb{E}[r+\gamma V_t(s') | S_t = s] = \sum_a\pi(a|s)\sum_{s', r}p(s', r|s,a)(r+\gamma V_t(s'))
51+
$$
52+
53+
#### Policy Improvement
54+
55+
Given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to $q_{\pi}(s,a).$ In other words, we make a new policy by acting greedily.
56+
57+
$$
58+
Q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t = s, A_t = a] = \sum_{s', r} p(s', r|s, a)(r+\gamma V_\pi (s'))
59+
$$
60+
61+
#### Policy Iteration
62+
63+
Once a policy, $\pi$, has been improved using $V_{\pi}$ to yield a better policy, $\pi'$, we can then compute $V_{\pi}'$ and improve it again to yield an even better $\pi''$. We can thus obtain a sequence of monotonically improving policies and value functions:
64+
$$\pi_0 \xrightarrow{E}V_{\pi_0}\xrightarrow{I}\pi_1 \xrightarrow{E}V_{\pi_1}\xrightarrow{I}\pi_2 \xrightarrow{E}\dots\xrightarrow{I}\pi_*\xrightarrow{E}V_{\pi_*}$$
65+
where $\xrightarrow{E}$ denotes a policy evaluation and $\xrightarrow{I}$ denotes a policy improvement.
66+
67+
### Monte-Carlo Methods
68+
Monte-Carlo (MC) methods require only experience --- sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. It learns from actual experience without no prior knowledge of the environment's dynamics. To compute the empirical return $G_t$, MC methods need to learn complete episodes $S_1, A_1, R_2, \dots, S_T$ to compute $G_t = \sum_{k=0}^{T-t-1}\gamma^kR_{t+k+1}$ and all the episodes must eventually terminate no matter what actions are selected.
69+
70+
The empirical mean return for state $s$ is:
71+
$$V(s)=\frac{\sum_{t=1}^T\mathbf{1}[S_t=s]G_t}{\sum_{t=1}^T\mathbf{1}[S_t = s]}$$
72+
Each occurrence of state $s$ in an episode is called a visit to $s$. We may count the visit of state $s$ every time so that there could exist multiple visits of one state in one episode ("every-visit"), or only count it the first time we encounter a state in one episode ("first-visit"). In practical, first-visit MC converges faster with lower average root mean squared error. A intuitive explanation is that it ignores data from other visits to $s$ after the first, which breaks the correlation between data resulting in unbiased estimate.
73+
74+
This way of approximation can be easily extended to action-value functions by counting $(s, a)$ pair.
75+
$$Q(s,a) = \frac{\sum_{t=1}^T\mathbf{1}[S_t = s, A_t = a]G_t}{\sum_{t=1}^T[S_t = s, A_t =a]}$$
76+
To learn the optimal policy by MC, we iterate it by following a similar idea to Generalized Policy iteration (GPI).
77+
78+
1. Improve the policy greedily with respect to the current value function: $$\pi(s) = \arg\max_{a\in A}Q(s,a)$$
79+
80+
2. Generate a new episode with the new policy $\pi$ (i.e. using algorithms like $\epsilon$-greedy helps us balance between exploitation and exploration)
81+
82+
3. Estimate $Q$ using the new episode: $$q_\pi(s, a) = \frac{\sum_{t = 1}^T(\mathbf{1}[S_t = s, A_t = a]\sum_{k = 0}^{T-t-1}\gamma^kR_{t+k+1})}{\sum_{t=1}^T\mathbf{1}[S_t = s, A_t = a]}$$
83+
84+
### Temporal-Difference Learning
85+
86+
Temporal-difference (TD) learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
87+
Similar to Monte-Carlo methods, Temporal-Difference (TD) Learning is model-free and learns from episodes of experience. However, TD learning can learn from incomplete episodes.
88+
$$Q(s, a) = R(s,a) + \gamma Q^\pi(s',a')$$
89+
90+
#### Comparison between MC and TD}
91+
92+
MC regresses $Q(s,a)$ with targets $y = \sum_i r(s_i, a_i)$. Each rollout has randomness due to stochasticity in policy and environment. Therefore, to estimate $Q(s,a)$, we need to generate many trajectories and average over such stochasticity, which is a high variance estimate. But it is unbiased meaning the return is the true target.
93+
94+
TD estimates $Q(s,s)$ with $y = r(s,a)+\gamma Q^\pi(s',a')$, where $Q^\pi(s',a')$ already accounts for stochasticity of future states and actions. Thus, the estimate has lower variance meaning it needs fewer samples to get a good estimate. But the estimate is biased: if $Q(s', a')$ has approximation errors, the target $y$ has approximation errors; this could lead to unstable training due to error propagation.
95+
96+
#### Bootstrapping
97+
98+
TD learning methods update targets in the following equation with regard to existing estimates rather than exclusively relying on actual rewards and complete returns as in MC methods in Equation (\ref{eq:9}). This approach is known as bootstrapping.
99+
$$
100+
\begin{aligned}
101+
V(S_t) &\leftarrow V(S_t) +\alpha[G_t - V(S_t)]\\
102+
V(S_t) &\leftarrow V(S_t) +\alpha[R_{t+1} +\gamma V(S_t) - V(S_t)]
103+
\end{aligned}
104+
$$
105+
106+
## Summary
107+
108+
Here are some simple methods used in Reinforcement Learning. There are a lot of fancy stuff, but due to limited pages, not included here. Feel free to update the wiki to keep track of the latest algorithms of RL.
109+
110+
## See Also:
111+
112+
<https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html>
113+
114+
## Further Reading
115+
116+
- Introduction to Reinforcement Learning, MIT Press
117+
118+
## References
119+
120+
Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.

0 commit comments

Comments
 (0)