Skip to content

Commit d081399

Browse files
author
cer
committed
Merge branch 'master' of ssh://10.84.140.3:8022/nfs/project/rl_learn
2 parents 023d011 + acf90f1 commit d081399

File tree

6 files changed

+268
-104
lines changed

6 files changed

+268
-104
lines changed

1_gridworld.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -515,9 +515,9 @@
515515
"metadata": {
516516
"anaconda-cloud": {},
517517
"kernelspec": {
518-
"display_name": "Python [conda env:py35]",
518+
"display_name": "Python 3",
519519
"language": "python",
520-
"name": "conda-env-py35-py"
520+
"name": "python3"
521521
},
522522
"language_info": {
523523
"codemirror_mode": {
@@ -529,7 +529,7 @@
529529
"name": "python",
530530
"nbconvert_exporter": "python",
531531
"pygments_lexer": "ipython3",
532-
"version": "3.5.4"
532+
"version": "3.6.4"
533533
}
534534
},
535535
"nbformat": 4,

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@
1818

1919
所有的实验源代码都在`lib`目录下,来自[dennybritz](https://github.com/dennybritz/reinforcement-learning),这里只做解读和归总。
2020

21-
- [Gridworld](https://github.com/applenob/rl_learn/blob/master/1_gridworld.ipynb)对应MDP的**Dynamic Programming**
22-
- [Blackjack](https://github.com/applenob/rl_learn/blob/master/2_blackjack.ipynb)对应Model Free的**Monte Carlo**的Planning和Controlling
23-
- [Windy Gridworld](https://github.com/applenob/rl_learn/blob/master/3_windy_gridworld.ipynb)对应Model Free的**Temporal Difference**的On-Policy Controlling,**SARSA**
24-
- [Cliff Walking](https://github.com/applenob/rl_learn/blob/master/4_cliff_walking.ipynb)对应Model Free的Temporal Difference的Off-Policy ControllingQ-learning。
25-
- [Mountain Car](https://github.com/applenob/rl_learn/blob/master/5_mountain_car.ipynb):对应Q表格很大无法处理(state空间连续)的Q-Learning with Linear Function Approximation。
26-
- [Atari](https://github.com/applenob/rl_learn/blob/master/6_atari.ipynb)对应Deep-Q Learning。
21+
- [Gridworld](https://github.com/applenob/rl_learn/blob/master/1_gridworld.ipynb)对应**MDP****Dynamic Programming**
22+
- [Blackjack](https://github.com/applenob/rl_learn/blob/master/2_blackjack.ipynb)对应**Model Free****Monte Carlo**的Planning和Controlling
23+
- [Windy Gridworld](https://github.com/applenob/rl_learn/blob/master/3_windy_gridworld.ipynb)对应**Model Free****Temporal Difference****On-Policy Controlling****SARSA**
24+
- [Cliff Walking](https://github.com/applenob/rl_learn/blob/master/4_cliff_walking.ipynb)对应**Model Free****Temporal Difference****Off-Policy Controlling****Q-learning**
25+
- [Mountain Car](https://github.com/applenob/rl_learn/blob/master/5_mountain_car.ipynb):对应Q表格很大无法处理(state空间连续)**Q-Learning with Linear Function Approximation**
26+
- [Atari](https://github.com/applenob/rl_learn/blob/master/6_atari.ipynb)对应**Deep-Q Learning**
2727

2828
## 其他重要学习资料:
2929

book/bookdraft2018.pdf

1.46 MB
Binary file not shown.

learning_route.ipynb

Lines changed: 46 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -12,44 +12,57 @@
1212
"\n",
1313
"- 1.了解强化学习的**特征**。也就是它区别于监督学习的地方,能解决什么类型的问题。\n",
1414
"- 2.了解强化学习的**常用解决方案**。针对不同的问题,解决方案主要分成两部分:\n",
15-
" - 第一部分是**表格类解决方案(Tabular Solution Method)**,针对的是离散的、简单的问题\n",
16-
" - 第二部分是**近似解决方案(Approximate Solution Methods)**,针对的是连续的、复杂的问题\n",
15+
" - 第一部分是**表格类解决方案(Tabular Solution Method)**,针对的是**离散的、简单的**问题\n",
16+
" - 第二部分是**近似解决方案(Approximate Solution Methods)**,针对的是**连续的、复杂的**问题\n",
1717
"\n",
1818
"\n",
1919
"## 具体一些\n",
2020
"\n",
21-
"- 1.理解强化学习的几大要素:`policy`,`reward function`,`value function`,`modle (of the environment)`,分清`Reward`,`value function`和`q function`的区别。\n",
21+
"### 1.理解强化学习的几大要素\n",
22+
"`policy`,`reward function`,`value function`,`modle (of the environment)`,分清`Reward`,`value function`和`q function`的区别。\n",
2223
"\n",
23-
"- 2.理解MDP的概念,MDP是对环境的一种建模,能覆盖绝大多数的强化学习问题。\n",
24-
" - `Bellman Expectation Equation`:$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n",
25-
" - `Bellman Optimality Equation`:$v_*(s)=\\underset{a\\in A(s)}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_*(s')]$和$q_*(s,a)=\\sum_{s',r}p(s',r|s,a)[r+\\gamma \\underset{a'}{max}q_*(s', a')]$\n",
26-
" - 二者本质上都是递推公式,其中蕴含的**“backup”**思路,给动态规划打下基础。\n",
24+
"### 2.理解MDP的概念\n",
25+
"MDP是对环境的一种建模,能覆盖绝大多数的强化学习问题。\n",
26+
"- `Bellman Expectation Equation`:$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n",
27+
"- `Bellman Optimality Equation`:$v_*(s)=\\underset{a\\in A(s)}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_*(s')]$和$q_*(s,a)=\\sum_{s',r}p(s',r|s,a)[r+\\gamma \\underset{a'}{max}q_*(s', a')]$\n",
28+
"- 二者本质上都是递推公式,其中蕴含的**“backup”**思路,也就是从后一个状态的价值,逆推回前一个状态的价值。\n",
2729
"\n",
28-
"- 3.有了MDP的概念,接下来考虑如何解决MDP的问题。\n",
29-
" - `planning`代表如何评估$v_{\\pi}$,\n",
30-
" - `controlling`代表如何根据评估,找到最优的策略$\\pi_*$\n",
31-
" - `policy iteration`:每轮迭代先评估policy,获得新的$v$;再根据新的$v$,修改policy。$v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n",
32-
" - `value iteration`:每轮迭代直接寻找当轮最优的$v$,所有迭代完再找到对应的最优policy。$v_{k+1}(s)=\\underset{a}{max} E[R_{t+1}+\\gamma v_k(S_{t+1})|S_t=s, A_t=a]\\\\=\\underset{a}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_k(s')]$\n",
30+
"### 3.有了MDP的概念,接下来考虑如何解决MDP的问题。\n",
31+
"解决MDP的问题的三种基本方法:\n",
32+
"- 动态规划(Dynamic Programming):理论上完美,但需要很强的算力和准确的环境model\n",
33+
"- 蒙特卡洛(Monte Carlo Methods):不需要模型,可操作性强,但不太适合一步一步的增量计算。\n",
34+
"- 差分学习(Temporal-Difference Learning):不需要模型,也是增量式的,但分析起来很复杂。\n",
3335
"\n",
34-
"- 4.接下来考虑model free,即没有状态转移概率。\n",
35-
" - `planning`:\n",
36-
" - `MC(Monte-Carlo)`:每次生成一条完整的episode,然后,对某个状态$s$计算平均$G$作为对$v$的估计。$V(s)\\leftarrow average(Return(s))$\n",
37-
" - `TD(Temporal Difference)`:每做一次决定,都可以更新$v$:$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n",
38-
" - `controlling`:\n",
39-
" - `On-Policy Monte-Carlo Control`:`MC planning`+`greedy policy`\n",
40-
" - `On-Policy Sarsa Control`:`TD planning`+`greedy policy`,但是更新从$v$变成了$q$,因此叫`SARSA`:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n",
41-
" - Off-Policy:在生成episode的时候,直接使用一种较优的policy去指导它,而不是最终的`target policy`,指导的policy称为`behavior policy`。\n",
36+
"三种基本方法之间又可以相互结合:\n",
37+
"- 蒙特卡洛+差分学习,使用多步bootstrap。\n",
38+
"- 差分学习+模型学习。\n",
39+
"\n",
40+
"- `planning`代表如何评估$v_{\\pi}$。\n",
41+
"- `controlling`代表如何根据评估,找到最优的策略$\\pi_*$。\n",
42+
"- `policy iteration`:每轮迭代先评估policy,获得新的$v$;再根据新的$v$,修改policy。\n",
43+
" - $v_{\\pi}(s) = \\sum_a\\pi(a|s)\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_{\\pi}(s')]\\;\\;\\forall s \\in S$\n",
44+
"- `value iteration`:每轮迭代直接寻找当轮最优的$v$,所有迭代完再找到对应的最优policy。\n",
45+
" - $v_{k+1}(s)=\\underset{a}{max} E[R_{t+1}+\\gamma v_k(S_{t+1})|S_t=s, A_t=a]\\\\=\\underset{a}{max}\\sum_{s',r}p(s',r|s,a)[r+\\gamma v_k(s')]$\n",
46+
"\n",
47+
"### 4.接下来考虑model free,即没有状态转移概率。\n",
48+
"- `planning`:\n",
49+
" - `MC(Monte-Carlo)`:每次生成一条完整的episode,然后,对某个状态$s$计算平均$G$作为对$v$的估计。$V(s)\\leftarrow average(Return(s))$\n",
50+
" - `TD(Temporal Difference)`:每做一次决定,都可以更新$v$:$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n",
51+
"- `controlling`:\n",
52+
" - `On-Policy Monte-Carlo Control`:`MC planning`+`greedy policy`\n",
53+
" - `On-Policy Sarsa Control`:`TD planning`+`greedy policy`,但是更新从$v$变成了$q$,因此叫`SARSA`:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n",
54+
" - Off-Policy:在生成episode的时候,直接使用一种较优的policy去指导它,而不是最终的`target policy`,指导的policy称为`behavior policy`。\n",
4255
" - `Off-Policy Monte-Carlo Control`:相比`On-Policy Monte-Carlo Control`y加入重要性采样。\n",
4356
" - `Off-Policy Q-learning`:相比于`Sarsa`,$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S, A)]$\n",
4457
"\n",
45-
"- 5.近似函数:\n",
46-
" - 近似价值函数:目标$J(w) = E_{\\pi}[(v_{\\pi}(S)-\\hat v(S,w))^2]$,使近似的价值函数接近实际的价值函数。\n",
47-
" - `Q-Learning with Linear Function Approximation`\n",
48-
" - `Deep-Q Learning(DQN)`:使用了`Experience Replay`和`fixed Q-learning target`。\n",
49-
" - 拟合策略函数:目标$J_1(\\theta)=V^{\\pi_{\\theta}}(s_1) = E_{\\pi_{\\theta}}[v_1]$,使找到的策略函数可以使价值函数最大化。\n",
50-
" - `Monte-Carlo Policy Gradient (REINFORCE)`\n",
51-
" - 近似价值函数 + 拟合策略函数\n",
52-
" - ` Actor-Critic`:Critic:更新价值函数的参数$w$ 。Actor:更新策略的参数 $θ$ ,使用critic建议的方向。"
58+
"### 5.近似函数:\n",
59+
"- 近似价值函数:目标$J(w) = E_{\\pi}[(v_{\\pi}(S)-\\hat v(S,w))^2]$,使近似的价值函数接近实际的价值函数。\n",
60+
" - `Q-Learning with Linear Function Approximation`\n",
61+
" - `Deep-Q Learning(DQN)`:使用了`Experience Replay`和`fixed Q-learning target`。\n",
62+
"- 拟合策略函数:目标$J_1(\\theta)=V^{\\pi_{\\theta}}(s_1) = E_{\\pi_{\\theta}}[v_1]$,使找到的策略函数可以使价值函数最大化。\n",
63+
" - `Monte-Carlo Policy Gradient (REINFORCE)`\n",
64+
"- 近似价值函数 + 拟合策略函数\n",
65+
" - ` Actor-Critic`:Critic:更新价值函数的参数$w$ 。Actor:更新策略的参数 $θ$ ,使用critic建议的方向。"
5366
]
5467
},
5568
{
@@ -64,21 +77,21 @@
6477
],
6578
"metadata": {
6679
"kernelspec": {
67-
"display_name": "Python 2",
80+
"display_name": "Python 3",
6881
"language": "python",
69-
"name": "python2"
82+
"name": "python3"
7083
},
7184
"language_info": {
7285
"codemirror_mode": {
7386
"name": "ipython",
74-
"version": 2
87+
"version": 3
7588
},
7689
"file_extension": ".py",
7790
"mimetype": "text/x-python",
7891
"name": "python",
7992
"nbconvert_exporter": "python",
80-
"pygments_lexer": "ipython2",
81-
"version": "2.7.14"
93+
"pygments_lexer": "ipython3",
94+
"version": "3.6.4"
8295
}
8396
},
8497
"nbformat": 4,

0 commit comments

Comments
 (0)