|
12 | 12 | "\n",
|
13 | 13 | "- 1.了解强化学习的**特征**。也就是它区别于监督学习的地方,能解决什么类型的问题。\n",
|
14 | 14 | "- 2.了解强化学习的**常用解决方案**。针对不同的问题,解决方案主要分成两部分:\n",
|
15 |
| - " - 第一部分是**表格类解决方案(Tabular Solution Method)**,针对的是离散的、简单的问题;\n", |
16 |
| - " - 第二部分是**近似解决方案(Approximate Solution Methods)**,针对的是连续的、复杂的问题。\n", |
| 15 | + " - 第一部分是**表格类解决方案(Tabular Solution Method)**,针对的是**离散的、简单的**问题;\n", |
| 16 | + " - 第二部分是**近似解决方案(Approximate Solution Methods)**,针对的是**连续的、复杂的**问题。\n", |
17 | 17 | "\n",
|
18 | 18 | "\n",
|
19 | 19 | "## 具体一些\n",
|
20 | 20 | "\n",
|
21 |
| - "- 1.理解强化学习的几大要素:`policy`,`reward function`,`value function`,`modle (of the environment)`,分清`Reward`,`value function`和`q function`的区别。\n", |
| 21 | + "### 1.理解强化学习的几大要素\n", |
| 22 | + "`policy`,`reward function`,`value function`,`modle (of the environment)`,分清`Reward`,`value function`和`q function`的区别。\n", |
22 | 23 | "\n",
|
23 |
| - "- 2.理解MDP的概念,MDP是对环境的一种建模,能覆盖绝大多数的强化学习问题。\n", |
24 |
| - " - `Bellman Expectation Equation`:$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n", |
25 |
| - " - `Bellman Optimality Equation`:$v_*(s)=\\underset{a\\in A(s)}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_*(s')]$和$q_*(s,a)=\\sum_{s',r}p(s',r|s,a)[r+\\gamma \\underset{a'}{max}q_*(s', a')]$\n", |
26 |
| - " - 二者本质上都是递推公式,其中蕴含的**“backup”**思路,给动态规划打下基础。\n", |
| 24 | + "### 2.理解MDP的概念\n", |
| 25 | + "MDP是对环境的一种建模,能覆盖绝大多数的强化学习问题。\n", |
| 26 | + "- `Bellman Expectation Equation`:$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n", |
| 27 | + "- `Bellman Optimality Equation`:$v_*(s)=\\underset{a\\in A(s)}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_*(s')]$和$q_*(s,a)=\\sum_{s',r}p(s',r|s,a)[r+\\gamma \\underset{a'}{max}q_*(s', a')]$\n", |
| 28 | + "- 二者本质上都是递推公式,其中蕴含的**“backup”**思路,也就是从后一个状态的价值,逆推回前一个状态的价值。\n", |
27 | 29 | "\n",
|
28 |
| - "- 3.有了MDP的概念,接下来考虑如何解决MDP的问题。\n", |
29 |
| - " - `planning`代表如何评估$v_{\\pi}$,\n", |
30 |
| - " - `controlling`代表如何根据评估,找到最优的策略$\\pi_*$。\n", |
31 |
| - " - `policy iteration`:每轮迭代先评估policy,获得新的$v$;再根据新的$v$,修改policy。$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n", |
32 |
| - " - `value iteration`:每轮迭代直接寻找当轮最优的$v$,所有迭代完再找到对应的最优policy。$v_{k+1}(s)=\\underset{a}{max} E[R_{t+1}+\\gamma v_k(S_{t+1})|S_t=s, A_t=a]\\\\=\\underset{a}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_k(s')]$\n", |
| 30 | + "### 3.有了MDP的概念,接下来考虑如何解决MDP的问题。\n", |
| 31 | + "解决MDP的问题的三种基本方法:\n", |
| 32 | + "- 动态规划(Dynamic Programming):理论上完美,但需要很强的算力和准确的环境model。\n", |
| 33 | + "- 蒙特卡洛(Monte Carlo Methods):不需要模型,可操作性强,但不太适合一步一步的增量计算。\n", |
| 34 | + "- 差分学习(Temporal-Difference Learning):不需要模型,也是增量式的,但分析起来很复杂。\n", |
33 | 35 | "\n",
|
34 |
| - "- 4.接下来考虑model free,即没有状态转移概率。\n", |
35 |
| - " - `planning`:\n", |
36 |
| - " - `MC(Monte-Carlo)`:每次生成一条完整的episode,然后,对某个状态$s$计算平均$G$作为对$v$的估计。$V(s)\\leftarrow average(Return(s))$\n", |
37 |
| - " - `TD(Temporal Difference)`:每做一次决定,都可以更新$v$:$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n", |
38 |
| - " - `controlling`:\n", |
39 |
| - " - `On-Policy Monte-Carlo Control`:`MC planning`+`greedy policy`\n", |
40 |
| - " - `On-Policy Sarsa Control`:`TD planning`+`greedy policy`,但是更新从$v$变成了$q$,因此叫`SARSA`:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n", |
41 |
| - " - Off-Policy:在生成episode的时候,直接使用一种较优的policy去指导它,而不是最终的`target policy`,指导的policy称为`behavior policy`。\n", |
| 36 | + "三种基本方法之间又可以相互结合:\n", |
| 37 | + "- 蒙特卡洛+差分学习,使用多步bootstrap。\n", |
| 38 | + "- 差分学习+模型学习。\n", |
| 39 | + "\n", |
| 40 | + "- `planning`代表如何评估$v_{\\pi}$。\n", |
| 41 | + "- `controlling`代表如何根据评估,找到最优的策略$\\pi_*$。\n", |
| 42 | + "- `policy iteration`:每轮迭代先评估policy,获得新的$v$;再根据新的$v$,修改policy。\n", |
| 43 | + " - $v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n", |
| 44 | + "- `value iteration`:每轮迭代直接寻找当轮最优的$v$,所有迭代完再找到对应的最优policy。\n", |
| 45 | + " - $v_{k+1}(s)=\\underset{a}{max} E[R_{t+1}+\\gamma v_k(S_{t+1})|S_t=s, A_t=a]\\\\=\\underset{a}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_k(s')]$\n", |
| 46 | + "\n", |
| 47 | + "### 4.接下来考虑model free,即没有状态转移概率。\n", |
| 48 | + "- `planning`:\n", |
| 49 | + " - `MC(Monte-Carlo)`:每次生成一条完整的episode,然后,对某个状态$s$计算平均$G$作为对$v$的估计。$V(s)\\leftarrow average(Return(s))$\n", |
| 50 | + " - `TD(Temporal Difference)`:每做一次决定,都可以更新$v$:$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n", |
| 51 | + "- `controlling`:\n", |
| 52 | + " - `On-Policy Monte-Carlo Control`:`MC planning`+`greedy policy`\n", |
| 53 | + " - `On-Policy Sarsa Control`:`TD planning`+`greedy policy`,但是更新从$v$变成了$q$,因此叫`SARSA`:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n", |
| 54 | + " - Off-Policy:在生成episode的时候,直接使用一种较优的policy去指导它,而不是最终的`target policy`,指导的policy称为`behavior policy`。\n", |
42 | 55 | " - `Off-Policy Monte-Carlo Control`:相比`On-Policy Monte-Carlo Control`y加入重要性采样。\n",
|
43 | 56 | " - `Off-Policy Q-learning`:相比于`Sarsa`,$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S, A)]$\n",
|
44 | 57 | "\n",
|
45 |
| - "- 5.近似函数:\n", |
46 |
| - " - 近似价值函数:目标$J(w) = E_{\\pi}[(v_{\\pi}(S)-\\hat v(S,w))^2]$,使近似的价值函数接近实际的价值函数。\n", |
47 |
| - " - `Q-Learning with Linear Function Approximation`\n", |
48 |
| - " - `Deep-Q Learning(DQN)`:使用了`Experience Replay`和`fixed Q-learning target`。\n", |
49 |
| - " - 拟合策略函数:目标$J_1(\\theta)=V^{\\pi_{\\theta}}(s_1) = E_{\\pi_{\\theta}}[v_1]$,使找到的策略函数可以使价值函数最大化。\n", |
50 |
| - " - `Monte-Carlo Policy Gradient (REINFORCE)`\n", |
51 |
| - " - 近似价值函数 + 拟合策略函数\n", |
52 |
| - " - ` Actor-Critic`:Critic:更新价值函数的参数$w$ 。Actor:更新策略的参数 $θ$ ,使用critic建议的方向。" |
| 58 | + "### 5.近似函数:\n", |
| 59 | + "- 近似价值函数:目标$J(w) = E_{\\pi}[(v_{\\pi}(S)-\\hat v(S,w))^2]$,使近似的价值函数接近实际的价值函数。\n", |
| 60 | + " - `Q-Learning with Linear Function Approximation`\n", |
| 61 | + " - `Deep-Q Learning(DQN)`:使用了`Experience Replay`和`fixed Q-learning target`。\n", |
| 62 | + "- 拟合策略函数:目标$J_1(\\theta)=V^{\\pi_{\\theta}}(s_1) = E_{\\pi_{\\theta}}[v_1]$,使找到的策略函数可以使价值函数最大化。\n", |
| 63 | + " - `Monte-Carlo Policy Gradient (REINFORCE)`\n", |
| 64 | + "- 近似价值函数 + 拟合策略函数\n", |
| 65 | + " - ` Actor-Critic`:Critic:更新价值函数的参数$w$ 。Actor:更新策略的参数 $θ$ ,使用critic建议的方向。" |
53 | 66 | ]
|
54 | 67 | },
|
55 | 68 | {
|
|
64 | 77 | ],
|
65 | 78 | "metadata": {
|
66 | 79 | "kernelspec": {
|
67 |
| - "display_name": "Python 2", |
| 80 | + "display_name": "Python 3", |
68 | 81 | "language": "python",
|
69 |
| - "name": "python2" |
| 82 | + "name": "python3" |
70 | 83 | },
|
71 | 84 | "language_info": {
|
72 | 85 | "codemirror_mode": {
|
73 | 86 | "name": "ipython",
|
74 |
| - "version": 2 |
| 87 | + "version": 3 |
75 | 88 | },
|
76 | 89 | "file_extension": ".py",
|
77 | 90 | "mimetype": "text/x-python",
|
78 | 91 | "name": "python",
|
79 | 92 | "nbconvert_exporter": "python",
|
80 |
| - "pygments_lexer": "ipython2", |
81 |
| - "version": "2.7.14" |
| 93 | + "pygments_lexer": "ipython3", |
| 94 | + "version": "3.6.4" |
82 | 95 | }
|
83 | 96 | },
|
84 | 97 | "nbformat": 4,
|
|
0 commit comments