Skip to content

Commit ef04fe4

Browse files
author
cer
committed
update
1 parent 951bfb6 commit ef04fe4

File tree

6 files changed

+58
-39
lines changed

6 files changed

+58
-39
lines changed

book/bookdraft2018.pdf

237 KB
Binary file not shown.

reinforcement_learning.ipynb

Lines changed: 58 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"- [](#)\n",
2626
"- [](#)\n",
2727
"- [](#)\n",
28-
"- [](#)\n",
28+
"- [13. Policy Gradient Methods](#13.-Policy-Gradient-Methods)\n",
2929
"- [](#)\n",
3030
"- [](#)\n",
3131
"- [](#)\n",
@@ -314,10 +314,10 @@
314314
"### on-policy vs off-policy\n",
315315
"- on-policy只有一套policy,更简单,是首选。\n",
316316
"- off-policy使用两套policy,更复杂、更难收敛;但也更通用、更强大。\n",
317+
"- on-policy和off-policy本质依然是Exploit vs Explore的权衡。\n",
317318
"\n",
318319
"### on-policy\n",
319320
"- 去评估和提高生成episode时采用的policy。**全过程只有一种策略**,MC ES属于on-policy。\n",
320-
"- ![](https://github.com/applenob/rl_learn/raw/master/res/on_policy_fv_mc_control.png)\n",
321321
"\n",
322322
"### off-policy\n",
323323
"- 所有的MC控制方法都面临一个**困境**:它们都想找到一个最优的策略,但却必须采用非最优的策略去尽可能多地探索(explore)数据。\n",
@@ -345,7 +345,6 @@
345345
"- 上面的评估使用了采样权重增量式的方法。\n",
346346
"- 控制:\n",
347347
"- ![](https://github.com/applenob/rl_learn/raw/master/res/off_policy_mc_control.png)\n",
348-
"\n",
349348
"\n"
350349
]
351350
},
@@ -356,49 +355,67 @@
356355
"## 6. Temporal-Difference Learning\n",
357356
"\n",
358357
"\n",
359-
"****:\n",
360-
"![](https://github.com/applenob/rl_learn/raw/master/res/td0_est.png)\n",
361-
"\n",
362-
"$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n",
363-
"![](https://github.com/applenob/rl_learn/raw/master/res/sarsa_2.png)\n",
364-
"![](https://github.com/applenob/rl_learn/raw/master/res/sarsa_est.png)\n",
365-
"$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n",
366-
"![](https://github.com/applenob/rl_learn/raw/master/res/q_learn_backup.png)\n",
367-
"\n",
368-
"![](https://github.com/applenob/rl_learn/raw/master/res/q_learn.png)\n",
369-
"\n",
370-
"\n",
371-
"****:价值函数更新:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S, A)]$\n",
372-
"\n",
373-
"****:\n",
374-
"\n",
375-
"****:\n",
376-
"\n",
377-
"****:\n",
378-
"\n",
379-
"****:\n",
380-
"\n",
381-
"****:\n",
382-
"\n",
383-
"****:\n",
384-
"\n",
385-
"****:"
358+
"### 时序差分(Temporal-Difference)简介\n",
359+
"- 时序差分是强化学习的核心观点。\n",
360+
"- 时序差分是DP和MC方法的结合。\n",
361+
"- 时序差分不需要像MC一样,要等一个完整的序列结束;相反,每经历一步,都会更新价值函数。\n",
362+
"- TD往往比MC高效\n",
363+
"\n",
364+
"### 什么是stationary?\n",
365+
"- stationary:环境不随时间变化而变化;\n",
366+
"- non-stationary:环境会随时间变化而变化。\n",
367+
"\n",
368+
"### TD(0)\n",
369+
"- $V(S_t)\\leftarrow V(S_t)+\\alpha[R_{t+1}+\\gamma V(S_{t+1})-V(S_t)]$\n",
370+
"- 因为直接使用现有的估计取更新估计,因此这种方法被称为**自举(bootstrap)**。\n",
371+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/td0_est.png)\n",
372+
"- **TD error**:$\\delta_t = R_{t+1}+\\gamma V(S_{t+1})-V(S_t)$\n",
373+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/td0.png)\n",
374+
"\n",
375+
"### Sarsa\n",
376+
"- 一种on-policy的TD控制。\n",
377+
"- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t)+\\alpha[R_{t+1}+\\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]$\n",
378+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/sarsa_est.png)\n",
379+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/sarsa_backup.png)\n",
380+
"\n",
381+
"### Q-learning\n",
382+
"- 一种off-policy的TD控制。\n",
383+
"- 早期强化学习的一个突破。\n",
384+
"- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t)+\\alpha[R_{t+1}+\\gamma \\underset{a}{max}Q(S_{t+1},a)-Q(S_t,A_t)]$\n",
385+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/q_learn.png)\n",
386+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/q_learn_backup.png)\n",
387+
"\n",
388+
"### Expected Sarsa\n",
389+
"- 一种off-policy的TD控制。\n",
390+
"- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t) + \\alpha[R_{t+1} + \\gamma\\sum_a\\pi(a|S_{t+1})Q(S_{t+1}, a)-Q(S_t,A_t)]$\n",
391+
"\n",
392+
"### Double Learning\n",
393+
"- 解决Q-learning的**最大化偏差(maximization bias)**问题\n",
394+
"- 2011年提出。\n",
395+
"- ![](https://github.com/applenob/rl_learn/raw/master/res/double_q_learn.png)"
386396
]
387397
},
388398
{
389399
"cell_type": "markdown",
390400
"metadata": {},
391-
"source": []
401+
"source": [
402+
"## 7. n-step Bootstrapping\n",
403+
"\n"
404+
]
392405
},
393406
{
394407
"cell_type": "markdown",
395408
"metadata": {},
396-
"source": []
409+
"source": [
410+
"## 8. Planning and Learning with Tabular Methods\n"
411+
]
397412
},
398413
{
399414
"cell_type": "markdown",
400415
"metadata": {},
401-
"source": []
416+
"source": [
417+
"## 9. On-policy Prediction with Approximation"
418+
]
402419
},
403420
{
404421
"cell_type": "markdown",
@@ -418,7 +435,9 @@
418435
{
419436
"cell_type": "markdown",
420437
"metadata": {},
421-
"source": []
438+
"source": [
439+
"## 13. Policy Gradient Methods"
440+
]
422441
},
423442
{
424443
"cell_type": "markdown",
@@ -461,21 +480,21 @@
461480
"metadata": {
462481
"anaconda-cloud": {},
463482
"kernelspec": {
464-
"display_name": "Python 3",
483+
"display_name": "Python 2",
465484
"language": "python",
466-
"name": "python3"
485+
"name": "python2"
467486
},
468487
"language_info": {
469488
"codemirror_mode": {
470489
"name": "ipython",
471-
"version": 3
490+
"version": 2
472491
},
473492
"file_extension": ".py",
474493
"mimetype": "text/x-python",
475494
"name": "python",
476495
"nbconvert_exporter": "python",
477-
"pygments_lexer": "ipython3",
478-
"version": "3.6.4"
496+
"pygments_lexer": "ipython2",
497+
"version": "2.7.14"
479498
}
480499
},
481500
"nbformat": 4,

res/double_q_learn.png

100755100644
37.7 KB
Loading

res/q_learn.png

100755100644
29.4 KB
Loading

res/sarsa_est.png

100755100644
36.1 KB
Loading

res/td_0.png

4.52 KB
Loading

0 commit comments

Comments
 (0)