|
25 | 25 | "- [](#)\n",
|
26 | 26 | "- [](#)\n",
|
27 | 27 | "- [](#)\n",
|
28 |
| - "- [](#)\n", |
| 28 | + "- [13. Policy Gradient Methods](#13.-Policy-Gradient-Methods)\n", |
29 | 29 | "- [](#)\n",
|
30 | 30 | "- [](#)\n",
|
31 | 31 | "- [](#)\n",
|
|
314 | 314 | "### on-policy vs off-policy\n",
|
315 | 315 | "- on-policy只有一套policy,更简单,是首选。\n",
|
316 | 316 | "- off-policy使用两套policy,更复杂、更难收敛;但也更通用、更强大。\n",
|
| 317 | + "- on-policy和off-policy本质依然是Exploit vs Explore的权衡。\n", |
317 | 318 | "\n",
|
318 | 319 | "### on-policy\n",
|
319 | 320 | "- 去评估和提高生成episode时采用的policy。**全过程只有一种策略**,MC ES属于on-policy。\n",
|
320 |
| - "- \n", |
321 | 321 | "\n",
|
322 | 322 | "### off-policy\n",
|
323 | 323 | "- 所有的MC控制方法都面临一个**困境**:它们都想找到一个最优的策略,但却必须采用非最优的策略去尽可能多地探索(explore)数据。\n",
|
|
345 | 345 | "- 上面的评估使用了采样权重增量式的方法。\n",
|
346 | 346 | "- 控制:\n",
|
347 | 347 | "- \n",
|
348 |
| - "\n", |
349 | 348 | "\n"
|
350 | 349 | ]
|
351 | 350 | },
|
|
356 | 355 | "## 6. Temporal-Difference Learning\n",
|
357 | 356 | "\n",
|
358 | 357 | "\n",
|
359 |
| - "****:\n", |
360 |
| - "\n", |
361 |
| - "\n", |
362 |
| - "$V(S)\\leftarrow V(S)+\\alpha[R+\\gamma V(S')-V(S)]$\n", |
363 |
| - "\n", |
364 |
| - "\n", |
365 |
| - "$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S,A)]$\n", |
366 |
| - "\n", |
367 |
| - "\n", |
368 |
| - "\n", |
369 |
| - "\n", |
370 |
| - "\n", |
371 |
| - "****:价值函数更新:$Q(S,A)\\leftarrow Q(S,A)+\\alpha[R+\\gamma Q(S',A')-Q(S, A)]$\n", |
372 |
| - "\n", |
373 |
| - "****:\n", |
374 |
| - "\n", |
375 |
| - "****:\n", |
376 |
| - "\n", |
377 |
| - "****:\n", |
378 |
| - "\n", |
379 |
| - "****:\n", |
380 |
| - "\n", |
381 |
| - "****:\n", |
382 |
| - "\n", |
383 |
| - "****:\n", |
384 |
| - "\n", |
385 |
| - "****:" |
| 358 | + "### 时序差分(Temporal-Difference)简介\n", |
| 359 | + "- 时序差分是强化学习的核心观点。\n", |
| 360 | + "- 时序差分是DP和MC方法的结合。\n", |
| 361 | + "- 时序差分不需要像MC一样,要等一个完整的序列结束;相反,每经历一步,都会更新价值函数。\n", |
| 362 | + "- TD往往比MC高效\n", |
| 363 | + "\n", |
| 364 | + "### 什么是stationary?\n", |
| 365 | + "- stationary:环境不随时间变化而变化;\n", |
| 366 | + "- non-stationary:环境会随时间变化而变化。\n", |
| 367 | + "\n", |
| 368 | + "### TD(0)\n", |
| 369 | + "- $V(S_t)\\leftarrow V(S_t)+\\alpha[R_{t+1}+\\gamma V(S_{t+1})-V(S_t)]$\n", |
| 370 | + "- 因为直接使用现有的估计取更新估计,因此这种方法被称为**自举(bootstrap)**。\n", |
| 371 | + "- \n", |
| 372 | + "- **TD error**:$\\delta_t = R_{t+1}+\\gamma V(S_{t+1})-V(S_t)$\n", |
| 373 | + "- \n", |
| 374 | + "\n", |
| 375 | + "### Sarsa\n", |
| 376 | + "- 一种on-policy的TD控制。\n", |
| 377 | + "- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t)+\\alpha[R_{t+1}+\\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]$\n", |
| 378 | + "- \n", |
| 379 | + "- \n", |
| 380 | + "\n", |
| 381 | + "### Q-learning\n", |
| 382 | + "- 一种off-policy的TD控制。\n", |
| 383 | + "- 早期强化学习的一个突破。\n", |
| 384 | + "- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t)+\\alpha[R_{t+1}+\\gamma \\underset{a}{max}Q(S_{t+1},a)-Q(S_t,A_t)]$\n", |
| 385 | + "- \n", |
| 386 | + "- \n", |
| 387 | + "\n", |
| 388 | + "### Expected Sarsa\n", |
| 389 | + "- 一种off-policy的TD控制。\n", |
| 390 | + "- $Q(S_t,A_t)\\leftarrow Q(S_t,A_t) + \\alpha[R_{t+1} + \\gamma\\sum_a\\pi(a|S_{t+1})Q(S_{t+1}, a)-Q(S_t,A_t)]$\n", |
| 391 | + "\n", |
| 392 | + "### Double Learning\n", |
| 393 | + "- 解决Q-learning的**最大化偏差(maximization bias)**问题\n", |
| 394 | + "- 2011年提出。\n", |
| 395 | + "- " |
386 | 396 | ]
|
387 | 397 | },
|
388 | 398 | {
|
389 | 399 | "cell_type": "markdown",
|
390 | 400 | "metadata": {},
|
391 |
| - "source": [] |
| 401 | + "source": [ |
| 402 | + "## 7. n-step Bootstrapping\n", |
| 403 | + "\n" |
| 404 | + ] |
392 | 405 | },
|
393 | 406 | {
|
394 | 407 | "cell_type": "markdown",
|
395 | 408 | "metadata": {},
|
396 |
| - "source": [] |
| 409 | + "source": [ |
| 410 | + "## 8. Planning and Learning with Tabular Methods\n" |
| 411 | + ] |
397 | 412 | },
|
398 | 413 | {
|
399 | 414 | "cell_type": "markdown",
|
400 | 415 | "metadata": {},
|
401 |
| - "source": [] |
| 416 | + "source": [ |
| 417 | + "## 9. On-policy Prediction with Approximation" |
| 418 | + ] |
402 | 419 | },
|
403 | 420 | {
|
404 | 421 | "cell_type": "markdown",
|
|
418 | 435 | {
|
419 | 436 | "cell_type": "markdown",
|
420 | 437 | "metadata": {},
|
421 |
| - "source": [] |
| 438 | + "source": [ |
| 439 | + "## 13. Policy Gradient Methods" |
| 440 | + ] |
422 | 441 | },
|
423 | 442 | {
|
424 | 443 | "cell_type": "markdown",
|
|
461 | 480 | "metadata": {
|
462 | 481 | "anaconda-cloud": {},
|
463 | 482 | "kernelspec": {
|
464 |
| - "display_name": "Python 3", |
| 483 | + "display_name": "Python 2", |
465 | 484 | "language": "python",
|
466 |
| - "name": "python3" |
| 485 | + "name": "python2" |
467 | 486 | },
|
468 | 487 | "language_info": {
|
469 | 488 | "codemirror_mode": {
|
470 | 489 | "name": "ipython",
|
471 |
| - "version": 3 |
| 490 | + "version": 2 |
472 | 491 | },
|
473 | 492 | "file_extension": ".py",
|
474 | 493 | "mimetype": "text/x-python",
|
475 | 494 | "name": "python",
|
476 | 495 | "nbconvert_exporter": "python",
|
477 |
| - "pygments_lexer": "ipython3", |
478 |
| - "version": "3.6.4" |
| 496 | + "pygments_lexer": "ipython2", |
| 497 | + "version": "2.7.14" |
479 | 498 | }
|
480 | 499 | },
|
481 | 500 | "nbformat": 4,
|
|
0 commit comments