applenob
diff --git a/‎README.md‎
Lines changed: 19 additions & 1 deletion b/‎README.md‎
Lines changed: 19 additions & 1 deletion
diff --git a/‎book/bookdraft2018.pdf‎
23.8 KB b/‎book/bookdraft2018.pdf‎
23.8 KB
diff --git a/‎notes/intro_note_01.md‎
Lines changed: 133 additions & 6 deletions b/‎notes/intro_note_01.md‎
Lines changed: 133 additions & 6 deletions
diff --git a/‎notes/intro_note_03.md‎
Lines changed: 2 additions & 2 deletions b/‎notes/intro_note_03.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎res/ttt_demo.jpg‎
41 KB b/‎res/ttt_demo.jpg‎
41 KB
diff --git a/‎res/ttt_value.jpg‎
114 KB b/‎res/ttt_value.jpg‎
114 KB
@@ -12,7 +12,25 @@
 
 - [David Silver 的 Reinforcement Learning 课程学习笔记。](class_note.ipynb)
 - [课程对应的所有PPT](slides)
-- [Sutton 的 Reinforcement Learning: An Introduction书本学习笔记](reinforcement_learning.ipynb)
+- Sutton 的 Reinforcement Learning: An Introduction书本学习笔记
+  - [1. Introduction](notes/intro_note_01.md)
+  - [2. Multi-armed Bandits](notes/intro_note_02.md)
+  - [3. Finite Markov DecisionProcesses](#notes/intro_note_03.md)
+  - [4. Dynamic Programming](notes/intro_note_04.md)
+  - [5. Monte Carlo Methods](notes/intro_note_05.md)
+  - [6. Temporal-Difference Learning](notes/intro_note_06.md)
+  - [7. n-step Bootstrapping](notes/intro_note_07.md)
+  - [8. Planning and Learning with Tabular Methods](notes/intro_note_08.md)
+  - [9. On-policy Prediction with Approximation](notes/intro_note_09.md)
+  - [10. On-policy Control with Approximation](notes/intro_note_10.md)
+  - [11. Off-policy Methods with Approximation](notes/intro_note_11.md)
+  - [12. Eligibility Traces](notes/intro_note_12.md)
+  - [13. Policy Gradient Methods](notes/intro_note_13.md)
+  - [14. Psychology](notes/intro_note_14.md)
+  - [15. Neuroscience](notes/intro_note_15.md)
+  - [16. Applications and Case Studies](notes/intro_note_16.md)
+  - [17. Frontiers](notes/intro_note_17.md)
+
 - [书本的各版本pdf](book)
   - [2017-6 draft](book/bookdraft2017june19.pdf)
   - [2018 second edition](book/bookdraft2018.pdf)
 
@@ -53,11 +53,138 @@
 
 ## Tic-Tac-Toe（井字棋）
 
-- ![tic-tac-toe](../res/ttt.png)
+![tic-tac-toe](../res/ttt.png)
+
 - 一个简单的应用强化学习的例子。
 - 定义policy：任何一种局面下，该如何落子。
-- 遗传算法解法：试很多种policy，找到最终胜利的几种，然后结合，更新。
-- 强化学习解法：
-  - 1.建立一张表格，state_num × 1，代表每个state下，获胜的概率，这个表格就是所谓的**value function**，即状态到价值的映射。
-  - 2.跟对手下很多局。每次落子的时候，依据是在某个state下，选择所有可能的后继state中，获胜概率最大的（value最大的）。这种方法即贪婪法（Exploit）。偶尔我们也随机选择一些其他的state（Explore）。
-  - 3.“back up”后继state的v到当前state上。$V(s)\leftarrow V(s)+\alpha[V(s')-V(s)]$，这就是所谓的**差分学习**（temporal-difference learning），这么叫是因为$V(s')-V(s)$是两个时间点上的两次估计的差。
+
+**遗传算法解法**：试很多种policy，找到最终胜利的几种，然后结合，更新。
+
+**强化学习解法**：
+
+- 1.建立一张表格，state_num × 1，代表每个state下，获胜的概率，这个表格就是所谓的**value function**，即状态到价值的映射。
+- 2.跟对手下很多局。每次落子的时候，依据是在某个state下，选择所有可能的后继state中，获胜概率最大的（value最大的）。这种方法即贪婪法（Exploit）。偶尔我们也随机选择一些其他的state（Explore）。
+- 3.**back up**后继state的v到当前state上。$V(s)\leftarrow V(s)+\alpha[V(s')-V(s)]$，这就是所谓的**差分学习**（temporal-difference learning），这么叫是因为$V(s')-V(s)$是两个时间点上的两次估计的差。
+
+### 代码分析
+
+[完整源码](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction/blob/master/chapter01/tic_tac_toe.py)
+
+游戏实现：
+
+用`1`代表白棋，`-1`代表黑棋，若有连续的三个数之和为3则白赢，-3则黑赢。若所有绝对值之和为9，则游戏为平局。
+
+```python
+for result in results:
+    if result == 3:
+        self.winner = 1
+        self.end = True
+        return self.end
+    if result == -3:
+        self.winner = -1
+        self.end = True
+        return self.end
+
+# whether it's a tie
+sum = np.sum(np.abs(self.data))
+if sum == BOARD_ROWS * BOARD_COLS:
+    self.winner = 0
+    self.end = True
+    return self.end
+```
+
+定义状态字典：
+
+```python
+all_states = dict()
+all_states[current_state.hash()] = (current_state, current_state.is_end())
+```
+
+其中，键名是状态的哈希值，值是状态对象以及该状态是否是终止状态。哈希值计算：
+
+```python
+# compute the hash value for one state, it's unique
+def hash(self):
+    if self.hash_val is None:
+        self.hash_val = 0
+        for i in self.data.reshape(BOARD_ROWS * BOARD_COLS):
+            if i == -1:
+                i = 2
+            self.hash_val = self.hash_val * 3 + i
+    return int(self.hash_val)
+```
+
+可以看到，状态的个数理论上应该是$3^9=19683$个，下面的价值表格的键数也一样是这个数字。
+
+价值表格也是用dict实现：
+
+```python
+self.estimations = dict()
+...
+for hash_val in all_states.keys():
+    (state, is_end) = all_states[hash_val]
+    if is_end:
+        if state.winner == self.symbol:
+            self.estimations[hash_val] = 1.0
+        elif state.winner == 0:
+            # we need to distinguish between a tie and a lose
+            self.estimations[hash_val] = 0.5
+        else:
+            self.estimations[hash_val] = 0
+    else:
+        self.estimations[hash_val] = 0.5
+```
+
+backup：
+
+```python
+# update value estimation
+def backup(self):
+    self.states = [state.hash() for state in self.states]
+
+    for i in reversed(range(len(self.states) - 1)):
+        state = self.states[i]
+        td_error = self.greedy[i] * (self.estimations[self.states[i + 1]] - self.estimations[state])
+        self.estimations[state] += self.step_size * td_error
+```
+
+决策使用epsilon-greedy：
+
+```python
+# choose an action based on the state
+def act(self):
+    state = self.states[-1]
+    next_states = []
+    next_positions = []
+    for i in range(BOARD_ROWS):
+        for j in range(BOARD_COLS):
+            if state.data[i, j] == 0:
+                next_positions.append([i, j])
+                next_states.append(state.next_state(i, j, self.symbol).hash())
+
+    if np.random.rand() < self.epsilon:
+        action = next_positions[np.random.randint(len(next_positions))]
+        action.append(self.symbol)
+        self.greedy[-1] = False
+        return action
+
+    values = []
+    for hash, pos in zip(next_states, next_positions):
+        values.append((self.estimations[hash], pos))
+    # to select one of the actions of equal value at random
+    np.random.shuffle(values)
+    values.sort(key=lambda x: x[0], reverse=True)
+    action = values[0][1]
+    action.append(self.symbol)
+    return action
+```
+
+可以在终端和训练好的ai player对弈：
+
+![ttt_demo](../res/ttt_demo.jpg)
+
+我试了好几局，都是平局，看来训练的还是不错的。
+
+![ttt_value](../res/ttt_value.jpg)
+
+模型训练好后，保存的数据就是价值表格。但我们从中也可以看到一个问题，一个像tic-tac-toe这么简单的问题，使用价值表格保存所有状态的价值，也需要耗费大量的存储。
@@ -56,9 +56,9 @@
 
 ## 价值函数
 
-- 关于策略$\pi$的state-value函数：$v_{\pi}(s) = {\mathbb{E}}_{\pi}[G_t|S_t=s]$ $=\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s]$
+- 关于策略$\pi$的state-value函数：$v_{\pi}(s) = {\mathbb{E}}_{\pi}[G_t|S_t=s] = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s]$
 - 即，**在使用策略$\pi$的前提下，衡量处于某个state有多好**。
-- 关于策略$\pi$的action-value函数：$q_{\pi}(a,s) = \mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$ $= \mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s,A_t=a]$
+- 关于策略$\pi$的action-value函数：$q_{\pi}(a,s) = \mathbb{E}_{\pi}[G_t|S_t=s,A_t=a] = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s,A_t=a]$
 - 即，在使用策略$\pi$的前提下，衡量处于某个state下，执行某个action有多好。
 
 ## Bellman Euqation