Skip to content

Commit 4f37c8c

Browse files
committed
slides and RL intro
1 parent 3e2bc85 commit 4f37c8c

File tree

2 files changed

+63
-33
lines changed

2 files changed

+63
-33
lines changed

08a_Basics_of_Reinforcement_Learning.ipynb

Lines changed: 63 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@
123123
"outputs": [],
124124
"source": [
125125
"# call this every iteration that we need to get\n",
126-
"# a batch of episodes. All environment interaction happend here \n",
126+
"# a batch of episodes. All environment interactions happend here \n",
127127
"def iterate_batches(env, net, batch_size):\n",
128128
" # this function is called to generate training batches\n",
129129
" # as discussed in lecture, the algorithm will \n",
@@ -379,7 +379,7 @@
379379
" # change observation space to one hot encoded version \n",
380380
" # we do this so that our neural network can stay the same\n",
381381
" # this defines the vector of length N, with values of 0.0 up to 1.0\n",
382-
" # In the gym a box is like a tensor (ugh)\n",
382+
" # In the gym a box is like a tensor...\n",
383383
" self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), \n",
384384
" dtype=np.float32)\n",
385385
"\n",
@@ -465,9 +465,7 @@
465465
{
466466
"cell_type": "code",
467467
"execution_count": 20,
468-
"metadata": {
469-
"scrolled": false
470-
},
468+
"metadata": {},
471469
"outputs": [
472470
{
473471
"name": "stdout",
@@ -545,7 +543,7 @@
545543
"\n",
546544
"**Why was this not working?**\n",
547545
"\n",
548-
"Firstly, the input space is sparse so its harder to learn new observations from the randomized neural network, especially for rarely occurring observations (like when we get past the first few steps). Also, the reward is only given at the end and its unlikely for us to reach the end, so we need to do alot of exploring... And most of the time, there is not percentile that actually worked, so we never learn to emulate the output. \n",
546+
"Firstly, the input space is sparse so its harder to learn new observations from the randomized neural network, especially for rarely occurring observations (like when we get past the first few steps). This is oour first insight into sample efficiency of an algorithm. The cross entropy method does not seem to be sample efficient when working with a sparse state space. Also, the reward is only given at the end and its unlikely for us to reach the end, so we need to do alot of exploring... And most of the time, there is not percentile that actually worked, so we never learn to emulate the output. \n",
549547
"\n",
550548
"It seems like even this simple problem is hard for cross entropy to solve. Perhaps we should go back to the basics of learning optimal policies? Yes! Let's see about value iteration.\n",
551549
"\n",
@@ -678,9 +676,7 @@
678676
{
679677
"cell_type": "code",
680678
"execution_count": 20,
681-
"metadata": {
682-
"scrolled": false
683-
},
679+
"metadata": {},
684680
"outputs": [
685681
{
686682
"name": "stdout",
@@ -729,9 +725,7 @@
729725
{
730726
"cell_type": "code",
731727
"execution_count": 21,
732-
"metadata": {
733-
"scrolled": false
734-
},
728+
"metadata": {},
735729
"outputs": [
736730
{
737731
"name": "stdout",
@@ -924,9 +918,7 @@
924918
{
925919
"cell_type": "code",
926920
"execution_count": 4,
927-
"metadata": {
928-
"scrolled": false
929-
},
921+
"metadata": {},
930922
"outputs": [
931923
{
932924
"name": "stdout",
@@ -1176,9 +1168,7 @@
11761168
{
11771169
"cell_type": "code",
11781170
"execution_count": 10,
1179-
"metadata": {
1180-
"scrolled": false
1181-
},
1171+
"metadata": {},
11821172
"outputs": [
11831173
{
11841174
"name": "stdout",
@@ -1508,7 +1498,10 @@
15081498
"cell_type": "code",
15091499
"execution_count": 1,
15101500
"metadata": {
1511-
"collapsed": true
1501+
"collapsed": true,
1502+
"jupyter": {
1503+
"outputs_hidden": true
1504+
}
15121505
},
15131506
"outputs": [],
15141507
"source": [
@@ -1649,7 +1642,10 @@
16491642
"cell_type": "code",
16501643
"execution_count": 2,
16511644
"metadata": {
1652-
"collapsed": true
1645+
"collapsed": true,
1646+
"jupyter": {
1647+
"outputs_hidden": true
1648+
}
16531649
},
16541650
"outputs": [],
16551651
"source": [
@@ -1700,9 +1696,7 @@
17001696
{
17011697
"cell_type": "code",
17021698
"execution_count": 4,
1703-
"metadata": {
1704-
"scrolled": false
1705-
},
1699+
"metadata": {},
17061700
"outputs": [
17071701
{
17081702
"name": "stdout",
@@ -1861,7 +1855,44 @@
18611855
"Best mean reward updated 0.540 -> 0.550, model saved\n",
18621856
"Best mean reward updated 0.550 -> 0.560, model saved\n",
18631857
"Best mean reward updated 0.560 -> 0.570, model saved\n",
1864-
"Best mean reward updated 0.570 -> 0.580, model saved\n"
1858+
"Best mean reward updated 0.570 -> 0.580, model saved\n",
1859+
"Best mean reward updated 0.580 -> 0.590, model saved\n",
1860+
"Best mean reward updated 0.590 -> 0.600, model saved\n",
1861+
"Best mean reward updated 0.600 -> 0.610, model saved\n",
1862+
"Best mean reward updated 0.610 -> 0.620, model saved\n",
1863+
"103300: done 8920 iterations, mean reward 0.620, eps 0.00\n",
1864+
"Best mean reward updated 0.620 -> 0.630, model saved\n",
1865+
"Best mean reward updated 0.630 -> 0.640, model saved\n",
1866+
"Best mean reward updated 0.640 -> 0.650, model saved\n",
1867+
"Best mean reward updated 0.650 -> 0.660, model saved\n",
1868+
"Best mean reward updated 0.660 -> 0.670, model saved\n",
1869+
"Best mean reward updated 0.670 -> 0.680, model saved\n",
1870+
"Best mean reward updated 0.680 -> 0.690, model saved\n",
1871+
"Best mean reward updated 0.690 -> 0.700, model saved\n",
1872+
"Best mean reward updated 0.700 -> 0.710, model saved\n",
1873+
"Best mean reward updated 0.710 -> 0.720, model saved\n",
1874+
"Best mean reward updated 0.720 -> 0.730, model saved\n",
1875+
"Best mean reward updated 0.730 -> 0.740, model saved\n",
1876+
"108400: done 9053 iterations, mean reward 0.690, eps 0.00\n",
1877+
"112700: done 9163 iterations, mean reward 0.580, eps 0.00\n",
1878+
"113200: done 9172 iterations, mean reward 0.590, eps 0.00\n",
1879+
"113900: done 9191 iterations, mean reward 0.570, eps 0.00\n",
1880+
"121200: done 9360 iterations, mean reward 0.690, eps 0.00\n",
1881+
"123400: done 9408 iterations, mean reward 0.710, eps 0.00\n",
1882+
"123600: done 9415 iterations, mean reward 0.710, eps 0.00\n",
1883+
"123800: done 9420 iterations, mean reward 0.730, eps 0.00\n",
1884+
"125300: done 9454 iterations, mean reward 0.650, eps 0.00\n",
1885+
"126500: done 9479 iterations, mean reward 0.630, eps 0.00\n",
1886+
"130100: done 9561 iterations, mean reward 0.730, eps 0.00\n",
1887+
"Best mean reward updated 0.740 -> 0.750, model saved\n",
1888+
"Best mean reward updated 0.750 -> 0.760, model saved\n",
1889+
"Best mean reward updated 0.760 -> 0.770, model saved\n",
1890+
"131000: done 9585 iterations, mean reward 0.770, eps 0.00\n",
1891+
"Best mean reward updated 0.770 -> 0.780, model saved\n",
1892+
"Best mean reward updated 0.780 -> 0.790, model saved\n",
1893+
"Best mean reward updated 0.790 -> 0.800, model saved\n",
1894+
"Best mean reward updated 0.800 -> 0.810, model saved\n",
1895+
"Solved in 132361 frames!\n"
18651896
]
18661897
},
18671898
{
@@ -2023,7 +2054,10 @@
20232054
"cell_type": "code",
20242055
"execution_count": 5,
20252056
"metadata": {
2026-
"collapsed": true
2057+
"collapsed": true,
2058+
"jupyter": {
2059+
"outputs_hidden": true
2060+
}
20272061
},
20282062
"outputs": [],
20292063
"source": [
@@ -2138,9 +2172,7 @@
21382172
{
21392173
"cell_type": "code",
21402174
"execution_count": null,
2141-
"metadata": {
2142-
"scrolled": false
2143-
},
2175+
"metadata": {},
21442176
"outputs": [],
21452177
"source": [
21462178
"# load up some utilities \n",
@@ -2208,9 +2240,7 @@
22082240
{
22092241
"cell_type": "code",
22102242
"execution_count": null,
2211-
"metadata": {
2212-
"scrolled": false
2213-
},
2243+
"metadata": {},
22142244
"outputs": [],
22152245
"source": [
22162246
"# training (no resets of the Agent or training values)\n",
@@ -2495,9 +2525,9 @@
24952525
"name": "python",
24962526
"nbconvert_exporter": "python",
24972527
"pygments_lexer": "ipython3",
2498-
"version": "3.8.16"
2528+
"version": "3.11.9"
24992529
}
25002530
},
25012531
"nbformat": 4,
2502-
"nbformat_minor": 2
2532+
"nbformat_minor": 4
25032533
}

PDF_slides/DL_6a_RL_intro.pdf

1.83 MB
Binary file not shown.

0 commit comments

Comments
 (0)