epilogue.tex

% !TEX root = thesis.tex
\documentclass[thesis]{subfiles}

\begin{document}
%*******************************************************************************
%*********************************** Epilogue *****************************
%*******************************************************************************

\chapter{Bibliographic Epilogue}
\label{epilogue}
%********************************** %First Section  **************************************
Research in deep learning has taken on a new rapidity since the adoption of pre-prints, and the explosion of interest in the field. Rather than being bound to the annual conference schedule, new research is released on a weekly basis. Presenting a paper at a contemporary conference, one is now in the odd situation of having to relate the `new' research being presented to the 6--12 months of follow-up research in the field. Compare this to even a few years ago, when publication of new research was withheld until a conference paper acceptance, perhaps a couple of months before the conference.

This dissertation represents the ultimate presentation of the research we have undertaken and, just as in a conference, it must also be presented in the context and timeline of the follow-up research and applications that it has already inspired.

In this section, we will briefly outline the significant derivative papers published after the original pre-print publication of the research we have presented here, along with their pre-print dates. We also present new research, published after ours that has extended the field towards learning structural priors automatically. 

\section{Pre-print Publication Dates}
For reference, the initial public release of the papers behind the work presented in this dissertation are outlined below:
\begin{itemize}
    \item Decision Forests, Convolutional Networks and the Models in-Between~\citep{Ioannou2015}
    \begin{description}
        \item[MSR internal technical report] Apr.\, 2015
        \item[Pre-print] 3 Mar.\ 2016
    \end{description}
    \item Training \glspl{cnn}\index{CNN} with Low-Rank Filters for Efficient Image Classification~\citep{Ioannou2016}
    \begin{description}
        \item[Pre-print] arXiv:1511.06744 (30 Nov.\ 2015)
        \item[Peer-reviewed publication date] May 2, 2016
    \end{description}
    \item Deep Roots: Improving \gls{cnn} Efficiency with Hierarchical Filter Groups~\citep{ioannou2016e}
    \begin{description}
        \item[Pre-print] arXiv:1605.06489 (20 May 2016)
        \item[Peer-reviewed publication date] CVPR 2017 (21 July, 2017)
    \end{description}
\end{itemize}

\section{Recent Research Related to \Cref{lowrankfilters}}
The method presented in this chapter directly applied to many deep learning vision problems, and we have heard directly from developers of it's use in embedded devices, in applications as diverse as some of the most popular mobile phones to autonomous drones.

Unfortunately even citing authors assume the paper present a low-rank approximation of the filters, despite our explicit and frequent statements that our models are trained from scratch and not approximated. This is likely due to the title and its similarity to the titles of a large amount of literature proceeding presenting methods of approximation.
\subsection*{Rethinking the Inception Architecture for Computer Vision}
\begin{description}
    \item[Pre-print] arxiv:1512.00567 (2 Dec.\ 2015)
    \item[Peer-reviewed publication date] CVPR 2016 (28 June, 2017)
\end{description}
\Citet{szegedy2015rethinking} published an update to the \Gls{inception} architecture around 3 weeks after our pre-print publication, making it likely that their work was independent. Nevertheless, the method they present is identical to our proposal~\citep{Ioannou2016} --- the training of the \Gls{inception} architecture with low-rank (\ie 1$\times$3 and 3$\times$1) filters to reduce computation and improve generalization. They call these `factorized filters' which we disagree with, since being simply concatenated they are not a factorization (\ie separation of a multiplication), rather we argue they are linearly combined, and so represent a basis.

The method proposed in this chapter (and identical to our proposed method) forms the basis of all current \Gls{inception} architectures, and are now used in all Google deep learning backed products, for example Google Photos\texttrademark.

\section{Recent Research Related to \Cref{deeproots}}
Unfortunately, likely due to the prominence of Google and Facebook in the research community, the Xception and ResNeXt papers below have received most of the citations and credit for root units in recent work, despite the pre-prints coming five and six months after our pre-print publication, respectively. On the other hand, this exposure has brought a lot of attention to the method, making the use of filter groups, and root modules specifically, for efficiency and generalization increasingly common. 

\subsubsection*{Xception: Deep Learning with Depthwise Separable Convolutions}
\begin{description}
    \item[Pre-print] arXiv:1610.02357 (7 Oct.\ 2016)
    \item[Peer-reviewed publication date] CVPR 2017 (21 July, 2017)
\end{description}
The author proposes the `Xception' module, which is a special case of our proposed root modules~\citep{ioannou2016e}, where, if $c_1$ is the number of input \gls{featuremap} channels, and $g$ is the number of groups (as in \cref{fig:rootmodule}), an `Xception' module is the extreme case where $g \equiv c_1$. In our experiments, this level of sparsity is clearly adverse. In the `Xception' experiments the number of filters is greatly increased as compared to the original architecture compensating for this. However, by using convolutional groups of the same size as input channels, any of the computational advantages of root modules are lost, for all the reasons described in \cref{gpuexplanation}. We should note that in personal correspondence with the author he has denied any relationship between this work and ours, and so does not cite our work.
% \lstset{
% basicstyle=\small\ttfamily,
% columns=flexible,
% breaklines=true
% }
% \begin{lstlisting}
% Hi Yani,

% A few big technical differences are:
% - separable convolution = depthwise convolution + pointwise convolution. There is no notion of pointwise convolution in filter groups\index{filter groups}.
% - depthwise convolution works depthwise and doesn't involve the notion of groups of filters.
% - filter groups\index{filter groups} involve 2-4 different kernels, while depthwise convolution involves as many kernels as input channels. In that sense it would be more accurate to relate filter groups\index{filter groups} to Inception modules.
% - the motivations are different, with separable convolution aiming at factorizing convolution kernels while filter groups\index{filter groups} aim at distributing workload on different GPUs.

% All in all, the relationship seems extremely tenuous. I asked Laurent Sifre about it, and it seems he was not inspired by filter groups\index{filter groups} at all during his work on developing separable convolutions. Depthwise separable convolutions are actually much more closely related to spatially separable convolutions, which have been in use for decades in the image processing community. Laurent's thesis has some pretty interesting insight about it.

% Cheers,
% Francois
% \end{lstlisting}
% I have pointed out that our root modules also used 1x1 convolutions after the filter groups\index{filter groups} in our resnet experiments, similar to what the author names `separable convolution'. Following adjacent (and independent) filter groups\index{filter groups}/`depthwise' convolution by a 1$\times$1 convolution, the network is not learning a `separable' (\ie factorized) convolution --- we are not separating a multiplication, you are separating a sum. This (if you are generous in ignoring the non-linearities) is in fact learning a basis - the 1$\times$1 convolution is linearly combining multiple convolutions.

% Unfortunately no further correspondence or acknowledgement was received, and the author does not cite our work.

\subsubsection*{Aggregated Residual Transformations for Deep Neural Networks}
\begin{description}
    \item[Pre-print] arXiv:1611.05431 (16 Nov.\ 2016)
    \item[Peer-reviewed publication date] CVPR 2017 (21 July, 2017)
\end{description}
The aggregated residual units proposed by \citet{saining2017} are technically identical to our root modules as implemented in ResNet (the paper cites our work), however they explore a different compute-generalization trade-off with their ResNeXt architecture. 

Rather than using the more efficient representation learned by the root units to save parameters and computation as in our work, the authors propose to maintain the original computational footprint of the model, and instead increase the number of filters learned. The result is a much improved network, so much so that their model won second place in the 2016 \gls{ilsvrc} competition. 

Interestingly, the authors claim that root units are even more effective on much larger datasets, and more effective than further increasing depth or width of the network. \citet{saining2017} state that ``\ldots increasing cardinality is more effective than going deeper or wider when we increase the capacity.'', where they denote the number of filter groups\index{filter groups} used as cardinality.

\subsubsection*{ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices}
\begin{description}
    \item[Pre-print] arXiv:1707.01083 (4 Jul.\ 2017)
\end{description}
\citet{zhang2017shufflenet}
\mynote{TODO}

\subsubsection*{Interleaved Group Convolutions for Deep Neural Networks}
\begin{description}
    \item[Pre-print] arXiv:1707.02725 (7 Oct.\ 2016)
    \item[Peer-reviewed publication date] October 22, 2017
\end{description}
\citet{zhang2017primal}
\mynote{TODO}

\subsubsection*{The Power of Sparsity in Convolutional Neural Networks}
\begin{description}
    \item[Pre-print] arXiv:1702.06257 (21 Feb.\ 2017)
\end{description}
\citet{changpinyo2017power}
\mynote{TODO}


\subsubsection*{Convolution with Logarithmic Filter Groups for Efficient Shallow {CNN}}
\begin{description}
    \item[Pre-print] arXiv:1707.09855 (31 Jul.\ 2017)
\end{description}
\citet{Lee2017}
\mynote{TODO}

\section{Recent Research Related to \Cref{conditionalnetworks}}
While being a very interesting topic, little has been published in relation to this paper since its pre-print release, aside from the work of \citet{BengioE2015} who used a reinforcement learning framework to add conditional computation to a \gls{dnn}. 
\subsubsection*{Conditional Computation in Neural Networks for Faster Models}
\begin{description}
    \item[Pre-print] arXiv:1511.06297 (19 Nov.\ 2015)
\end{description}
\citet{BengioE2015}
\mynote{TODO}
% http://trace.tennessee.edu/cgi/viewcontent.cgi?article=5323&context=utk_graddissf

\section[Automatically Learning Network Architectures]{Automatically Learning Network\texorpdfstring{\\}{ }Architectures}
Since the papers this dissertation is based on were published, the research community has been moving towards learning to create neural network architectures. In particular, the following papers have shown significant progress.
%\citet{andrychowicz2016learning}
%\subsection{Neural Architecture Search with Reinforcement Learning}
%\subsection{Designing Neural Network Architectures using Reinforcement Learning}
\citet{baker2017, zoph2017} both presented methods of learning neural network architectures using reinforcement learning. 
\mynote{TODO}
\end{document}