Skip to content

Commit 8e8f137

Browse files
committed
2 parents af82506 + 0fe650c commit 8e8f137

File tree

1 file changed

+41
-14
lines changed

1 file changed

+41
-14
lines changed

report/report.tex

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -54,15 +54,15 @@
5454

5555
\section{Introduction}
5656

57-
\subsection{Learning platform}
57+
\subsection{Learning Platform}
5858

5959
I based the platform on the Atmel ATmega328p, a cheap and widely available 8bit microcontroller. It has 32kb of program memory, 2kb of SRAM and operates at 16MHz. This chip is used by popular hobbyist electronics boards like the Arduino Uno. For this project, I used an Arduino Pro Mini development board. It does not have a floating point unit, though the Arduino runtime provides a software float implementation.
6060

6161
In order to construct a task representative of the complexity an embedded agent might face, I decided to build a two DOF arm with an LED actuator. The arm's joints are SG90 micro-servos, each capable of 180$\degree$ of rotation. The base joint is fixed to a surface, and the elbow joint is mounted to the base with 3 inch connecting dowels. The elbow joint controls another rod which is tipped with an LED. The motors are connected such that the middle of their rotation ranges align.
6262

6363
\subsection{Problem}
6464

65-
The agent must point the LED at a photo cell fixed to the surface in as few movements as possible, activating the LED as little as possible. The episode ends when the photo cell reads above a defined threshold.
65+
The agent must point the LED at a photo cell fixed to the surface in as few movements as possible, activating the LED as little as possible. It begins from a random initial configuration. The episode ends when the photo cell reads above a defined threshold.
6666

6767
\[ r(s,a,s') \left\{
6868
\begin{array}{ll}
@@ -73,26 +73,26 @@
7373
\right. \]
7474

7575

76-
\subsection{Learning approach}
76+
\subsection{Learning Approach}
7777

7878
\subsubsection{Tabular?}
7979

8080
The servo control library used for this project allows motor targets to be set with single-degree precision, so a single motor can have integer positions in $M = {1\degree, 180\degree]}$. The LED can be either on or off. Thus the state space is the set formed by $M \times M \times \{0,1\}$, which has cardinality 64800. At a given time step, the agent may choose to keep a joint fixed, move it left, or move it right. It can choose to activate or deactivate the LED. Assuming we restrict the agent to movements of unit magnitude, this means there are 18 actions.
8181

8282
A state-action value table based on this representation, assuming 4 byte floats, would occupy more than 4 megabytes of memory. Due to the spread of the light of the LED, it may be feasible to reduce the fidelity of the joint state representation and still achieve good performance, and because the optimal policy will likely always elect to move a joint, we may be able to remove actions which do not move a joint with little adverse effect. Even then, the microcontroller could only theoretically fit less than 10\% of all state action pairs (without careful optimization, quite a bit less, as the stack needs to live in memory as well).
8383

84-
\subsubsection{Function approximation}
84+
\subsubsection{Function Approximation}
8585

86-
Though it may initially seem like the microcontroller could support a reasonable number of features, inspection reveals this is not the case. Consider the episodic semi-gradient one-step Sarsa algorithm.
86+
The microcontroller cannot support as many features as one might hope. Consider the episodic semi-gradient one-step Sarsa algorithm.
8787

8888
\begin{equation}\label{eqn:update}
8989
\bm{\theta}_{t+1} = \bm{\theta}_t + \alpha \Big[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}\, \bm{\theta}_t) - \hat{q}(S_t, A_t, \bm{\theta}_t)\Big]\Delta\hat{q}(S_t, A_t, \bm{\theta}_t)\tag{1}
9090
\end{equation}
9191

92-
As shown in the implementation, it is possible to implement the update using only $n$ additional space, where $n$ is the number of weights. While this implementation is not difficult, it is easy to do incorrectly. For instance, if the action selection step is not placed before the memory allocation, the implementation will actually consume $2n$ stack memory; maximizing the value function over possible next states requires an additional $n$ stack space.
92+
As shown in the implementation, it is possible to implement the update using only $n$ additional space, where $n$ is the number of weights, but it is easy to do incorrectly. For instance, if the action selection step is not placed before the memory allocation, the implementation will actually consume $2n$ stack memory; maximizing the value function over possible next states requires an additional $n$ stack space.
9393

9494
\begin{algorithm}
95-
\caption{Memory-conservative Episodic Semi-gradient One-step Sarsa}
95+
\caption{Memory-conscious Episodic Semi-gradient One-step Sarsa}
9696
\label{alg:update}
9797
\begin{algorithmic}[1] % The number tells where the line numbering should start
9898
\Procedure{Update}{$S_t$, $A_t$, $S_{t+1}$, $\theta$}
@@ -114,18 +114,20 @@
114114

115115
Using approximation allows us to provide reasonable values across the breadth of the value function, but, due to the memory requirements of the update and maximization steps, not as accurate estimates as one might initially imagine.
116116

117-
I have not dwelt on time efficiency as 16MHz is a fair amount of computation, even with software floating point operations. For this project, I was satisfied as long as actions could be selected more quickly than the motors could execute them. Meeting this deadline, which was about 100ms, or 1.6 million cycles, was not an issue, even while the device was also streaming logging information over serial.\footnote{If time performance requirements were tighter, special attention would need to be paid to the very expensive process of action selection, which involves $|A|$ value function queries, each costing $n$ multiplications.}
117+
I have not dwelt on time efficiency as 16MHz is a fair amount of computation, even with software floating point operations. For this project, I was satisfied as long as actions could be selected more quickly than the motors could execute them. Meeting this deadline, which was about 100ms, or 1.6 million cycles, was not an issue, even while the device was also streaming logging information over serial.\footnote{If time performance requirements were tighter, special attention would need to be paid to the action selection process, which involves $|A|$ value function queries, each costing $n$ multiplications.}
118118

119119
%----------------------------------------------------------------------------------------
120120
% SECTION 3
121121
%----------------------------------------------------------------------------------------
122122

123123

124-
\section{Experimental setup}
124+
\section{Experimental Setup}
125+
126+
The agent learns for 50 steps, then operates in an evaluation mode for an additional 50 steps. $\gamma$ is 0.99 and $\alpha$ is 0.3. During episodes, a small delay is used between action execution and sensing to allow the arm to settle. The photocell threshold is calibrated before every session to ensure that no spurious rewards are granted.
125127

126128
\subsection{Features}
127129

128-
***
130+
The value function approximator uses a mix of coarse representation and custom features. The range of each joint is divided into sections of 20\degree and gridded, resulting in 64 features. Each of these features are gridded by whether or not the LED is on in the state. The three action features characterize the direction of each joints movement and whether or not the LED is activated.
129131

130132
%----------------------------------------------------------------------------------------
131133
% SECTION 4
@@ -135,23 +137,48 @@
135137
\section{Results}
136138

137139

138-
***
139-
140140
\begin{figure}[h]
141141
\begin{center}
142142
\includegraphics[width=\textwidth]{figure_0.pdf}
143143
\caption{***}
144144
\end{center}
145145
\end{figure}
146+
147+
\subsection{Discussion}
146148

149+
The agent learns and generalizes a fairly good policy within its first fifty steps. The evaluation period demonstrates that the agent's policy has the ability to perform well from arbitrary start positions.
150+
151+
As can be seen from the video, the agent's policy is not optimal. Since it lacks features that describe the interaction of the joint position with the value of turning the LED on, the agent activates the LED unnecessarily. Compared to joint movement, the correct activation is a less important problem, so it seems fair to trade a finer griding of the joint positions for a features that would better capture the use of the LED.
147152

148153

149154
%----------------------------------------------------------------------------------------
150-
% SECTION 3
155+
% SECTION 5
151156
%----------------------------------------------------------------------------------------
152157

153158
\section{Conclusions}
154159

155-
***
160+
I have demonstrated that it is possible to implement an embedded agent that achieves good performance on an arm control task.
161+
162+
\section{Appendix: Bill of Materials}
163+
164+
165+
\begin{center}
166+
\begin{tabular}{ l r r p{5cm} }
167+
\hline
168+
Component & Quantity & Unit Price (\$) & Note \\ \hline
169+
Arduino Pro Mini, 5v & 1 & 3.00 & \\
170+
Breadboard & 1 & 4.00 & \\
171+
SG90 servo & 2 & 3.00 & \\
172+
Dowel rods & 3 & 2.00 & \\
173+
1500uF capacitor & 2 & 1.00 & \\
174+
LED & 1 & 0.10 & The brighter the better.\\
175+
Photocell & 3 & 0.10 & \\
176+
5v 2.5A power supply & 1 & 8.00 & If variable supply unavailable. \\
177+
Assorted jumpers & & 3.00 & \\
178+
Adhesives, project surface & & 4.00 & \\
179+
\hline
180+
\end{tabular}
181+
\end{center}
182+
156183

157184
\end{document}

0 commit comments

Comments
 (0)