Merge branch 'master' of https://github.com/nickswalker/ArduinoRL

nickswalker · nickswalker · commit 8e8f13733867 · 2016-10-05T01:09:36.000-05:00
diff --git a/report/report.tex b/report/report.tex
@@ -54,15 +54,15 @@
 	
 	\section{Introduction}
 	
-	\subsection{Learning platform}
+	\subsection{Learning Platform}
 	
 	I based the platform on the Atmel ATmega328p, a cheap and widely available 8bit microcontroller. It has 32kb of program memory, 2kb of SRAM and operates at 16MHz. This chip is used by popular hobbyist electronics boards like the Arduino Uno. For this project, I used an Arduino Pro Mini development board. It does not have a floating point unit, though the Arduino runtime provides a software float implementation.
 	
 	In order to construct a task representative of the complexity an embedded agent might face, I decided to build a two DOF arm with an LED actuator. The arm's joints are SG90 micro-servos, each capable of 180$\degree$ of rotation. The base joint is fixed to a surface, and the elbow joint is mounted to the base with 3 inch connecting dowels. The elbow joint controls another rod which is tipped with an LED. The motors are connected such that the middle of their rotation ranges align.
 	
 	\subsection{Problem}
 	
-	The agent must point the LED at a photo cell fixed to the surface in as few movements as possible, activating the LED as little as possible. The episode ends when the photo cell reads above a defined threshold. 
+	The agent must point the LED at a photo cell fixed to the surface in as few movements as possible, activating the LED as little as possible. It begins from a random initial configuration. The episode ends when the photo cell reads above a defined threshold. 
 	
 	\[ r(s,a,s')  \left\{
 	\begin{array}{ll}
@@ -73,26 +73,26 @@
 	\right. \]
 	
 	
-	\subsection{Learning approach}
+	\subsection{Learning Approach}
 	
 	\subsubsection{Tabular?}
 	
 	The servo control library used for this project allows motor targets to be set with single-degree precision, so a single motor can have integer positions in $M = {1\degree, 180\degree]}$. The LED can be either on or off. Thus the state space is the set formed by $M \times M \times \{0,1\}$, which has cardinality 64800. At a given time step, the agent may choose to keep a joint fixed, move it left, or move it right. It can choose to activate or deactivate the LED. Assuming we restrict the agent to movements of unit magnitude, this means there are 18 actions.
 
 	A state-action value table based on this representation, assuming 4 byte floats, would occupy more than 4 megabytes of memory. Due to the spread of the light of the LED, it may be feasible to reduce the fidelity of the joint state representation and still achieve good performance, and because the optimal policy will likely always elect to move a joint, we may be able to remove actions which do not move a joint with little adverse effect. Even then, the microcontroller could only theoretically fit less than 10\% of all state action pairs (without careful optimization, quite a bit less, as the stack needs to live in memory as well).
 	
-	\subsubsection{Function approximation}
+	\subsubsection{Function Approximation}
 	
-	Though it may initially seem like the microcontroller could support a reasonable number of features, inspection reveals this is not the case. Consider the episodic semi-gradient one-step Sarsa algorithm.
+	The microcontroller cannot support as many features as one might hope. Consider the episodic semi-gradient one-step Sarsa algorithm.
 	
 	\begin{equation}\label{eqn:update}
 	\bm{\theta}_{t+1} = \bm{\theta}_t + \alpha \Big[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}\, \bm{\theta}_t) - \hat{q}(S_t, A_t, \bm{\theta}_t)\Big]\Delta\hat{q}(S_t, A_t, \bm{\theta}_t)\tag{1}
 	\end{equation}
 
-	As shown in the implementation, it is possible to implement the update using only $n$ additional space, where $n$ is the number of weights. While this implementation is not difficult, it is easy to do incorrectly. For instance, if the action selection step is not placed before the memory allocation, the implementation will actually consume $2n$ stack memory; maximizing the value function over possible next states requires an additional $n$ stack space.
+	As shown in the implementation, it is possible to implement the update using only $n$ additional space, where $n$ is the number of weights, but it is easy to do incorrectly. For instance, if the action selection step is not placed before the memory allocation, the implementation will actually consume $2n$ stack memory; maximizing the value function over possible next states requires an additional $n$ stack space.
 	
 		\begin{algorithm}
-			\caption{Memory-conservative Episodic Semi-gradient One-step Sarsa}
+			\caption{Memory-conscious Episodic Semi-gradient One-step Sarsa}
 			\label{alg:update}
 			\begin{algorithmic}[1] % The number tells where the line numbering should start
 				\Procedure{Update}{$S_t$, $A_t$, $S_{t+1}$, $\theta$}
@@ -114,18 +114,20 @@
 	
 	Using approximation allows us to provide reasonable values across the breadth of the value function, but, due to the memory requirements of the update and maximization steps, not as accurate estimates as one might initially imagine.
 	
-	I have not dwelt on time efficiency as 16MHz is a fair amount of computation, even with software floating point operations. For this project, I was satisfied as long as actions could be selected more quickly than the motors could execute them. Meeting this deadline, which was about 100ms, or 1.6 million cycles, was not an issue, even while the device was also streaming logging information over serial.\footnote{If time performance requirements were tighter, special attention would need to be paid to the very expensive process of action selection, which involves $|A|$ value function queries, each costing $n$ multiplications.}
+	I have not dwelt on time efficiency as 16MHz is a fair amount of computation, even with software floating point operations. For this project, I was satisfied as long as actions could be selected more quickly than the motors could execute them. Meeting this deadline, which was about 100ms, or 1.6 million cycles, was not an issue, even while the device was also streaming logging information over serial.\footnote{If time performance requirements were tighter, special attention would need to be paid to the action selection process, which involves $|A|$ value function queries, each costing $n$ multiplications.}
 
 	%----------------------------------------------------------------------------------------
 	%	SECTION 3
 	%----------------------------------------------------------------------------------------
 	
 	
-	\section{Experimental setup}
+	\section{Experimental Setup}
+
+	The agent learns for 50 steps, then operates in an evaluation mode for an additional 50 steps. $\gamma$ is 0.99 and $\alpha$ is 0.3. During episodes, a small delay is used between action execution and sensing to allow the arm to settle. The photocell threshold is calibrated before every session to ensure that no spurious rewards are granted.
 	
 	\subsection{Features}
 
-	***
+	The value function approximator uses a mix of coarse representation and custom features. The range of each joint is divided into sections of 20\degree and gridded, resulting in 64 features. Each of these features are gridded by whether or not the LED is on in the state. The three action features characterize the direction of each joints movement and whether or not the LED is activated.
 	
 	%----------------------------------------------------------------------------------------
 	%	SECTION 4
@@ -135,23 +137,48 @@
 	\section{Results}
 
 	
-	***
-	
 		\begin{figure}[h]
 			\begin{center}
 				\includegraphics[width=\textwidth]{figure_0.pdf}
 				\caption{***}
 			\end{center}
 		\end{figure}
+		
+	\subsection{Discussion}
 
+	The agent learns and generalizes a fairly good policy within its first fifty steps. The evaluation period demonstrates that the agent's policy has the ability to perform well from arbitrary start positions.
+	
+	As can be seen from the video, the agent's policy is not optimal. Since it lacks features that describe the interaction of the joint position with the value of turning the LED on, the agent activates the LED unnecessarily. Compared to joint movement, the correct activation is a less important problem, so it seems fair to trade a finer griding of the joint positions for a features that would better capture the use of the LED.
 	
 
 	%----------------------------------------------------------------------------------------
-	%	SECTION 3
+	%	SECTION 5
 	%----------------------------------------------------------------------------------------
 	
 	\section{Conclusions}
 
-	***
+	I have demonstrated that it is possible to implement an embedded agent that achieves good performance on an arm control task.
+	
+	\section{Appendix: Bill of Materials}
+	
+
+	\begin{center}
+		\begin{tabular}{ l r r  p{5cm} }
+			\hline
+			Component & Quantity & Unit Price (\$) & Note \\ \hline
+			Arduino Pro Mini, 5v & 1 & 3.00 &  \\ 
+			Breadboard & 1 & 4.00 & \\
+			SG90 servo & 2 & 3.00 & \\ 
+			Dowel rods & 3 & 2.00 & \\
+			1500uF capacitor & 2 & 1.00 & \\ 
+			LED & 1 & 0.10 & The brighter the better.\\ 
+			Photocell & 3 & 0.10 & \\ 
+			5v 2.5A power supply & 1 & 8.00 & If variable supply unavailable. \\ 
+			Assorted jumpers & & 3.00 & \\
+			Adhesives, project surface & & 4.00 & \\
+			\hline
+		\end{tabular}
+	\end{center}
+	
 	
 \end{document}