-
Notifications
You must be signed in to change notification settings - Fork 0
/
rpichap2.tex
271 lines (150 loc) · 81 KB
/
rpichap2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
\chapter{RELATED WORK} \label{chapter:related}
\noindent This chapter will review the published literature relating to the methods and algorithms presented in this dissertation. This work has three main intersections with previous research: 1) machine learning, deep learning, and neural networks, 2) computer vision applications, datasets, and techniques for image classification, object detection, and segmentation, and 3) animal re-identification for large-scale population monitoring. The work done by the computer vision field is vast, and animal applications represent a small (but growing) segment. In addition, there has been an increased number of papers and interest in cross-applying advanced computer vision algorithms on animals. This interest has grown enough to support new workshops at premier computer vision conferences like ICPR, AAAI, WACV, and CVPR under the general topics of ``Computer Vision for Social Good'' or simply ``Computer Vision for Animals''. The research presented in the following chapters fits well into these themes, and, hopefully, the state-of-the-art in automated wildlife monitoring will continue to be pursued and advanced.\blfootnote{Portions of this chapter previously appeared as: J. Parham and C. Stewart, ``Detecting plains and Grevy’s zebras in the real world,'' in \textit{IEEE Winter Conf. Applicat. Comput. Vis. Workshops}, Lake Placid, NY, USA, Mar. 2016, pp. 1–9.}\blfootnote{Portions of this chapter previously appeared as: J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. I. Rubenstein, ``Animal population censusing at scale with citizen science and photographic identification,'' in \textit{AAAI Spring Symp.}, Palo Alto, CA, USA, Jan. 2017, pp. 37–44.}\blfootnote{Portions of this chapter previously appeared as: J. Parham \textit{et al.}, ``An animal detection pipeline for identification,'' in \textit{IEEE Winter Conf. Applicat. Comput. Vis.}, Lake Tahoe, CA, USA, Mar. 2018, pp. 1–9.}
\section{Deep Learning \& Image Classification}
The domain of computer vision was thrust to the forefront of publicly known computer science applications with the rise of machine learning and, specifically, deep learning and neural networks~\cite{poultney_efficient_2006,hinton_fast_2006,marlin_inductive_2010,salakhutdinov_deep_2009}. Neural networks excelled at solving classic computer vision problems like image classification~\cite{krizhevsky_imagenet_2012,farabet_learning_2013,sermanet_overfeat:_2013, simonyan_very_2014, springenberg_striving_2014}, bounding box localization~\cite{sermanet_overfeat:_2013,erhan_scalable_2014, girshick_fast_2015,szegedy_going_2015}, and object detection~\cite{he_deep_2015,long_fully_2015,redmon_you_2016,ren_faster_2015} due to their ability to learn complex representations from supervised training data. The work presented here relies heavily on the advancements in neural network design and improvements in training procedures.
One of the tremendous technological advances of the deep learning era in computer vision has been the ability to learn how to represent an image with a feature extractor~\cite{bergstra_quadratic_2009,sharif_razavian_cnn_2014}. Furthermore, the ability to train a neural work end-to-end that can learn an objective (e.g., object classification) directly from pixels has been a transformative force within the domain. Therefore, it is essential to review a brief history of neural networks and their impact on the computer vision discipline. The following discussion sets the context for the deep learning methods used throughout this dissertation. In addition, it gives a chronological overview of when current machine learning techniques were introduced and why they are still used in modern applications.
\subsection{AlexNet \& Overfeat}
AlexNet~\cite{krizhevsky_imagenet_2012} was the original network that first broke the mold in 2012 of using hand-engineered features for computer vision tasks. The name ``AlexNet'' is a callback to ``LeNet'' by LeCun \textit{et al.}~\cite{lecun_comparison_1995,lecun_gradient-based_1998}, which was designed to perform handwritten digit classification~\cite{simard_best_2003} for the U.S. Postal Service in the early 2000's. The approach used by AlexNet achieved the lowest error for the classification and localization tasks in the widely popular ILSVRC~\cite{russakovsky_imagenet_2015} challenge in 2012. Until that point, the majority of computer vision applications~\cite{bertozzi_pedestrian_2007,dalal_histograms_2005,zhu_fast_2006} relied on SIFT~\cite{lowe_distinctive_2004}, Deformable Parts Models~\cite{felzenszwalb_discriminatively_2008}, and HOG~\cite{dalal_histograms_2005} for these tasks. The technique of Krizhevsky \textit{et al.} diverged strongly from the traditional thinking of hand-engineered feature extraction. Instead, AlexNet learned how to create high-dimensional representations from images that optimized a global loss function. The AlexNet network also first employed the use of dropout by Hinton \textit{et al.}~\cite{hinton_improving_2012} in a competition setting to regularize the final model better and prevent over-fitting. Dropout is used to train some of the neural networks in this research.
The basic instruction set is relatively small to compute a neural network layer's forward activations and backpropagation loss derivative. Deep learning algorithms often use hardware acceleration on Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs)~\cite{jouppi_-datacenter_2017} to drastically speed up the computation needed for training and inference. Since the computation is relatively simple, it can be naturally parallelized across thousands of smaller, less complex compute cores instead of a handful of general-purpose compute cores like what are found in a modern CPU. For example, the use of NVIDIA GPU hardware with CUDA~\cite{nickolls_scalable_2008} drastically reduces the training time of large neural networks by roughly 1.5 orders of magnitude compared to CPUs~\cite{nickolls_scalable_2008}. The AlexNet network was so novel and massive for its time that existing accelerator hardware sold on the open market was unable to handle its size. The authors engineered around that problem by training the network on two separate GPUs to avoid hitting a hard memory constraint. The work presented in this dissertation uses NVIDIA GPU hardware and CUDA to accelerate all of the training and forward inference.
Unfortunately, the original AlexNet network definition and training procedure were unpublished when they won the ILSVRC challenge. The authors of Overfeat by Sermanet \textit{et al.}~\cite{sermanet_overfeat:_2013} claim a very similar place in computer vision history by replicating this work and being the first to document and publish an implementation of convolutional classification with a multi-layer network. The work of~\cite{zeiler_visualizing_2014} with their ZFNet was mainly based on the AlexNet structure, but with new hyper-parameter tuning techniques, which led them to win the ILSVRC 2013 challenge.
\subsection{VGG}
The runner-up winners of the 2014 ILSVRC were the creators of the VGG network~\cite{simonyan_very_2014}, marking a significant improvement in neural network feature extraction over AlexNet and Overfeat. The advantage of the VGG network compared to previous networks was that it was exceedingly deep for its time, at 19 layers compared to the five convolutional layers of its predecessors. In addition, the VGG architecture used smaller 3x3 convolutional layers and 2x2 max-pooling layers throughout the network, simplifying the network's objective significantly and speeding up training time.
\subsection{Transfer Learning}
A significant advantage of the VGG network was that it began the first meaningful exploration of transfer learning~\cite{raina_self-taught_2007,oquab_learning_2014,yosinski_how_2014} since the authors had difficulty getting the deeper network to converge. The VGG authors first optimized a smaller convolutional network through ``pre-training'' and transferred the weights to the final network. With its convolutional filters better initialized, the network was then trained in a process called ``fine-tuning'' to create the final model. The benefits of fine-tuning should not be overlooked: the transferred filters are likely trained for a particular distribution and may apply inefficiently to a new dataset. Updating the convolutional weights of a transferred model with fine-tuning often improves overall performance. Transfer learning has also driven a massive exploration in neural network applications by allowing for convolutional filters trained on a larger dataset to be applied on smaller applications where not enough data exists to train the networks from scratch. We will see the technique of transfer learning applied in the animal detection pipeline and Census Annotation approaches.
\subsection{GoogLeNet \& Inception}
The first-place winners of the ILSVRC 2014 challenge was a Google team with their complex GoogLeNet architecture~\cite{szegedy_going_2015,chollet_xception_2017}. The network had a 6.67\% top-5 error rate, a noticeable improvement compared to the previous year's first-place winning performance of 14.8\%. The key insight of the GoogLeNet architecture was the use of ``inception modules'', which included a collection of multiple 3x3 and 1x1 convolutional filters within a single layer. The use of 1x1 convolutions is a variant of the research by Lin \textit{et al.} with its ``Network in Network'' convolutions that also had a filter size of 1x1~\cite{lin_network_2013}. The added inception modules effectively allowed the network to be deeper than the VGG network at 22 layers but with significantly fewer convolutional filter weights (roughly 4 million) than the original AlexNet approach (approximately 60 million). It was clear that deeper models generated superior results, but the research and competition communities still had difficulty training deep networks.
\subsection{Optimization Algorithms}
To improve training stability, the GoogLeNet model was trained by replacing the Stochastic Gradient Decent (SGD) optimizer with a different algorithm called RMSProp~\cite{ruder_overview_2016}, which was later combined with AdaGrad~\cite{duchi_adaptive_2011} and published as the ADAM optimizer~\cite{kingma_adam:_2014}. Neural network training had relied until then on various versions of Gradient Decent to optimize the initial conditions of the network weights. In general, a neural network model is initialized with a set of randomized weights (ignoring pre-trained weights) for a given initialization scheme~\cite{mishkin_all_2015,sutskever_importance_2013,bengio_greedy_2007}. An input image (or other data source) is given to the network for its feed-forward inference pass, and it outputs a vector of a pre-defined size. A loss function~\cite{specht_probabilistic_1990} is then used to compute the current error based on the difference from a provided ground-truth label and the network's output. The error loss for the output layer is then used to compute the loss with respect to the penultimate layer's outputs and repeated recursively for all layers in a process called ``back-propagation''~\cite{hecht-nielsen_theory_1989,rumelhart_learning_1986,specht_biased_2018}. The respective loss for each layer in the network is then used to update the current weights to reduce the overall error, representing one update step.
A more randomized variant of Gradient Descent, aptly called Stochastic Gradient Decent (SGD), was a successful attempt by~\cite{robbins_stochastic_1951,gardner_learning_1984} to speed up training through approximation. Gradient Descent in its purest form has the gradient calculated for the entire dataset and uses a single weight update per epoch. The key insight of SGD is that the network does not need to see the entire dataset to be able to compute a loss gradient that approximates the \textit{ideal} gradient for the current weights. Seeing a random sub-sample of the entire dataset is sufficient to calculate the loss for a given state of the weights, significantly speeding up the iterative learning process by adding many more update steps. SGD by itself does have a few optimization downsides: it is susceptible to saddle-points~\cite{jin_how_2017} and can oscillate wildly in ravines~\cite{werfel_learning_2005}, especially when the wrong learning rate schedule is used. To partially combat these effects, a momentum term can be added to the gradient~\cite{qian_momentum_1999,nesterov_method_1983} that adds a moving average (typically $\gamma = 0.9$) of past gradients to the current loss derivative. SGD alone without momentum~\cite{bengio_advances_2012} is also theorized not to be able to reliably find good global minima because it can easily get trapped in less optimal local minima. Another consideration with SGD is how large to make the sample size to ensure it is a representative statistical sampling. Mini-batch SGD~\cite{li_efficient_2014} uses small batches of examples (typically around 128) and averages their loss gradients into a single weight update. There has been extensive evaluation of mini-batch SGD~\cite{ruder_overview_2016,keskar_improving_2017,schaul_no_2012,bengio_learning_2009} within deep learning literature, including distributing the iterative training process to parallelize the gradient computation across multiple machines~\cite{hecht-nielsen_theory_1989,krizhevsky_one_2014,dean_large_2012}.
\subsection{Regularization}
Turning our attention back to the original discussion on image classification and GooLeNet, the authors used the ADAM optimizer because it works well with complex network architectures and is remarkably fast compared to mini-batch SGD with momentum. All of the neural networks presented in this dissertation are optimized using mini-batch SGD with momentum even though it is slower compared to ADAM (see~\cite{keskar_improving_2017}). Other regularization improvements used on GooLeNet such as batch normalization~\cite{ioffe_batch_2015} and more aggressive data augmentation~\cite{taylor_improving_2017,eggert_benefit_2015} schemes allowed the Google team to train such a deep model successfully. The work in this dissertation also applies both concepts for all of the neural network training.
\subsubsection{Batch Normalization}
Batch normalization~\cite{ioffe_batch_2015,ioffe_batch_2017} (also known as ``batch norm'') plays a critical role in the performance of deep neural network training as it normalizes the output of each layer to have a zero mean and standard deviation unit vector magnitude. In addition, batch norm helps to control run-away activations, oscillations, and exploding gradients~\cite{pascanu_understanding_2012}, lowering training time. When batch normalization is applied to a layer, it learns two additional parameters: $\gamma$ and $\beta$. The $\gamma$ term is used to scale the activations of a layer, and $\beta$ is added as an additional, layer-specific bias term. These values are learned from the statistics of each mini-batch. Furthermore, they are expected to approximate the mean and variance for the entire dataset for a given layer's activations.
\subsubsection{Weight Decay}
The two most common regularizers in neural network training are L1 (Laplacian) and L2 (Gaussian) weight decay. L1 regularization pushes certain weights to be \textit{exactly} zero and is analogous to having weight decay with a Laplacian prior on the $W$ weight matrices:
\begin{align}
\begin{split}
\Omega_{L1}(\theta) &= \sum_{k=1}^L \sum_{i=1}^{I^{(k-1)}} \sum_{j=1}^{J^{(k)}} \left | W_{i,j}^{(k)} \right |
\end{split}
\end{align}
\begin{align}
\begin{split}
\nabla_{W^{(k)}(x)}\Omega_{L1}(\theta) &= sign(W^{(k)})
\end{split}
\end{align}
\noindent L2 regularization pushes the weights \textit{towards} zero and is analogous to weight decay with a Gaussian prior on the weight matrices:
\begin{align}
\begin{split}
\Omega_{L2}(\theta) &= \sum_{k=1}^L \sum_{i=1}^{I^{(k-1)}} \sum_{j=1}^{J^{(k)}} \left ( W_{i,j}^{(k)} \right )^2 \\
&= \sum_{k=1}^L \left| \left| W^{(k)} \right | \right |_F^2
\end{split}
\end{align}
\begin{align}
\begin{split}
\nabla_{W^{(k)}(x)}\Omega_{L2}(\theta) &= 2*W^{(k)}
\end{split}
\end{align}
\noindent L2 weight decay is used extensively by the research community and used when training the neural networks presented in this dissertation. It is a very effective regularization technique when used with the ReLU~\cite{mishkin_all_2015,nair_rectified_2010,dahl_improving_2013} non-linear activation function and batch normalization.
\subsubsection{Data Augmentation}
Data augmentation~\cite{taylor_improving_2017,eggert_benefit_2015} is the process of applying a set of deterministic or randomized operations on an input image before it is to be used as an example when training a neural network. This process can be seen as a method of balancing the signal-noise ratio to help control over-fitting. Standard augmentation operations for image data include: adding exposure and hue changes, random Gaussian pixel noise, translation, rotation, skewing, horizontal and vertical flipping, color space transformations, and other sources of randomized pixel noise.
\subsection{Skip-connection Networks}
Neural network architectures before GoogLeNet were relatively linear and did not use multiple branches of activations for a given layer. GoogLeNet introduced the very complex (for its time) Inception Module and showed that complex flows of convolutional activations and their error gradients could be calculated and learned. Using this as insight, neural network researchers asked what would happen if a layer was not branched or copied into multiple streams but instead if some layers were skipped.
\subsubsection{Residual Networks (ResNets)}
The last ILSVRC image classification challenge, held in 2015, was won by He \textit{et al.} and their network ResNet (Residual Neural Network)~\cite{he_deep_2015}. The authors drastically increased the depth and circuit length of the neural network by using ``skip connections'' and liberal use of batch normalization throughout the network. As a result, the network achieved a top-5 error rate of 3.57\% and was surpassing human-level performance. The introduction of residual skip connections was a breakthrough in the development of neural network model architectures. The chief design challenge at the time was that deeper networks were shown to increase performance, but increasing the depth of the network caused training problems like vanishing gradients and co-adaptation~\cite{pascanu_difficulty_2013,lee_deeply-supervised_2014,glorot_deep_2011}. The benefit of residual connections is that the network can selectively turn off a convolutional filter by learning the additive identity~\cite{he_identity_2016}. The authors showed that the identity is not only easy to learn (especially with L2 regularization), but it also results in more stable and faster training because the skipped convolutional activations become trivial to calculate.
\subsubsection{Dense Residual Networks (DenseNet)}
An extension of residual networks is the work by Huang \textit{et al.}~\cite{huang_densely_2016} and their DenseNet architecture. The DenseNet model takes the idea of combining activations for a given layer and a skip connection and extends it by combining the activations from multiple previous layers through skip connections. They further show an increase in performance compared to ResNet (at the cost of speed) and argue that the performance increase comes from increased feature reuse and deep supervision learning~\cite{lee_deeply-supervised_2014} within the network. The whole-image classifier, annotation labeler, and Census Annotation models (described in Chapters~\ref{chapter:detection} and~\ref{chapter:ca}) use a pre-trained 201-layer DenseNet model as their feature extraction backbone.
\vspace{0.5cm}
\noindent The image classification task has essentially been considered solved by researchers, and new work in deep learning since 2015 has focused more on making networks smaller~\cite{ma_shufflenet_2018,tan_mnasnet_2019,iandola_squeezenet:_2016} significantly faster~\cite{redmon_you_2016,hubara_binarized_2016,iandola_densenet:_2014}, wider~\cite{zagoruyko_wide_2016}, or have moved on to more complex tasks like object detection, segmentation, and 3D applications. The foundation mentioned above of robust feature extraction and research in training improvements has led directly to using neural networks for detection tasks.
\section{Object Detection \& Semantic Segmentation}
The computer vision community after 2015 pivoted its focus to more complex tasks like object detection and semantic segmentation since improvements on the classification task were diminishing. The task of object detection is defined by the merging of two separate computer vision tasks: bounding box localization and image classification. Object detection is also getting close to being a solved problem, with real-time commodity implementations available on phones~\cite{howard_mobilenets_2017} and even readily accessible tools for the wildlife conservation community~\cite{beery_efficient_2019}. However, novelty is still being demonstrated for specific use-cases and real-world applications like large-scale animal re-identification. This section provides an overview of relevant methods to the work in this dissertation on animal detection for ID; a comprehensive review of object detection, evaluation primitives, and datasets can be found in~\cite{liu_deep_2019} and~\cite{zhao_object_2019}.
\subsection{Detection Before Deep Learning}
Before neural networks and deep learning became a ubiquitous solution for object detection, many algorithms employed hand-engineered feature descriptors and classifiers to find objects. This section gives a brief overview of the most common approaches.
\subsubsection{SVM Classifier on HOG and Sliding Windows}
Histogram of Oriented Gradients (HOG)~\cite{dalal_histograms_2005} was the pre-deep learning grandparent of feature extraction and object detection~\cite{felzenszwalb_discriminatively_2008,vondrick_hog-gles:_2013,felzenszwalb_object_2010}. The method applies a fixed-size sliding window across an image and extracts a HOG feature vector for that window. A Support Vector Machine (SVM)~\cite{cortes_support-vector_1995} is then used to train a classifier and perform binary classification. The windows are applied on a pyramid of multiple resolutions to support multiple scales of object detections~\cite{malisiewicz_ensemble_2011}. While these detectors could be trained quickly and with minimal data, they also suffered from poor general performance.
\subsubsection{Deformable Parts Models (DPM)}
Deformable Parts Models (DPM) by Felzenszwalb \textit{et al.}~\cite{felzenszwalb_discriminatively_2008} is a more sophisticated version of HOG and sliding windows and was widely popular. The DPM algorithm utilizes a 5-point star model (with a unique model per class) that learns a HOG feature classification for the entire image (the root) and latent variables for the locations of five different parts located around the root. The star pattern is designed to ``deform'' to find parts in slightly different locations and poses in relation to the root for a given object example. After neural networks had become ubiquitous, an attempt was made to merge their feature extraction abilities with DPM. The work of Wan \textit{et al.}~\cite{wan_end--end_2014} and~\cite{ouyang_deepid-net:_2014} provides an end-to-end trained model for using convolutional neural network features extraction with DPM and non-maximum suppression (NMS)~\cite{hosang_what_2015,bodla_soft-nmsimproving_2017,hosang_learning_2017} for object detection. The work of Girshick \textit{et al.}~\cite{girshick_deformable_2015} shows that DPM is a restricted version of convolutional neural networks and provides the argument that CNNs are a more capable and expressive formulation of DPM. While implicitly learned parts are not a component of the detection pipeline proposed in this thesis, it does support explicit, manually-defined parts that can be detected as separate annotations and then linked to a body annotation.
\subsubsection{Hough Random Forests}
The use of Hough Forests (i.e., Hough-transform~\cite{ballard_generalizing_1981} Random Forests) for object detection was demonstrated by Gall \textit{et al.} in~\cite{gall_class-specific_2009}. Unlike DPM, the algorithm is somewhat resilient to partial and occluded objects due to its voting scheme~\cite{winn_layout_2006,bonde_robust_2014}. The authors showed that random forests have advantageous training properties and extend naturally to patch-based image textures. They argue that the leaf nodes of a random forest tree can be considered a ``discriminative codebook''~\cite{moosmann_fast_2006}, which are used to generate classification probabilities. Furthermore, by training to optimize for both classification and regression within the same random forest tree, they can learn a spatial relationship of where a classified image patch is likely located in relation to an object's center. The approach is extended by Barinova \textit{et al.}~\cite{barinova_detection_2012} to address occluding objects while others have applied random forests to face, pose, and action recognition~\cite{dantone_real-time_2012,fanelli_real_2011,yao_hough_2010}; a comprehensive analysis of Hough Forests is presented in~\cite{gall_hough_2011}. A customized version of the implementation by Gall \textit{et al.} is evaluated in Chapter~\ref{chapter:detection} as a baseline algorithm against more modern neural network detection approaches.
\subsection{Datasets for Animal Detection}
Parallel to the rise of advanced machine learning methods was the creation of large computer vision datasets with supervised labels. However, the few approaches that have used neural networks for animal detection have focused on analyzing camera trap photos~\cite{beery_recognition_2018,verma_wild_2018,schneider_past_2019} and other applications for counting animals~\cite{sarwar_detecting_2018,trnovszky_animal_2017,lopez-vazquez_video_2020,rey_detecting_2017}. Exploring animal detection for animal identification often limits the related work to only animal re-identification methodologies, which often lack a detection component or data suitable for training a detector (i.e., pre-cropped images).
The concept of a detection pipeline, while not novel when considering its components separately, has not been comprehensively analyzed or reproduced in other works for animal ID. The detection pipeline is primarily designed to be used with ground-based photographs but can be re-tooled to work with overhead aerial images for the detection of animals~\cite{vermeulen_unmanned_2013,eikelboom_improving_2019,sevo_convolutional_2016,zhu_orientation_2015,sarwar_detecting_2021}.
\subsubsection{Visual Challenges: PASCAL VOC, ILSVRC \& COCO}
While the most prominent public datasets do not focus entirely on animals, they often contain bounding boxes for a handful of different animal species or high-level categories. For example, the PASCAL VOC Object Challenge (VOC)~\cite{everingham_pascal_2010} was one of the earliest datasets that had thousands of images and bounding boxes for 20 categories, including six animal classes (bird, cat, cow, dog, horse, sheep). The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset~\cite{russakovsky_imagenet_2015} was foundational for research in deep neural networks as it offered 1.2 Million images for 1,000 object categories, with a non-trivial portion representing animals. The scale and variety allowed the first generation of neural network models to train well and not severely overfit, giving time and diversity for general-purpose convolutional kernels to be learned. Unfortunately, the animal classes in ILSVRC are very general. For example, the synset \texttt{n02391049} for ``zebra'' includes multiple zebra species taken in the wild by professional photographers, zebras seen in zoos, stuffed zebra animal toys, fondant zebras on cakes, and other abstracted forms like ``zebra crosswalks''. Thus, the utility of this dataset for animal detection with real-world images is limited.
The Microsoft Common Objects in Context (COCO) dataset~\cite{lin_microsoft_2014} is a large dataset with 330,000 images for 80 categories, 10 of which are animals. Interestingly, the COCO dataset has instance segmentations for categories like ``zebra'' and ``giraffe'', which can train segmentation networks. The detection pipeline is designed to be bootstrapped and evaluated without the need for segmentation ground-truth, however, because fully annotated segmentation data is very laborious to annotate. Therefore, all of the methods herein are focused on bounding boxes.
\subsubsection{Camera Traps \& Citizen Science}
Other than large challenge datasets exists, there are community-based projects like Zooniverse's Snapshot Serengeti~\cite{simpson_zooniverse:_2014,swanson_snapshot_2015} that use citizen science~\cite{cohn_citizen_2008,irwin_citizen_1995,silvertown_new_2009} to annotate camera trap data for 40 African species. The iNaturalist~\cite{van_horn_inaturalist_2018} project also uses citizen science to gather and label image data for various animal species. These projects offer lots of data but do not support bounding boxes for animals and therefore also do not offer ground-truth animal ID data. One of the primary benefits of using citizen scientists is that a large number of volunteers can be used to survey a large area~\cite{haklay_geographical_2010,kumar_leafsnap:_2012,sullivan_ebird:_2009}. The Labeled Information Library of Alexandria: Biology and Conservation (LILA BC)\footnote{LILA BC - \url{http://lila.science} (Accessed: Oct. 29, 2021).} project run by Microsoft's AI for Earth initiative is a public repository of animal datasets for conservation. The vast majority of the datasets listed in this repository are based on camera trap imagery and are often limited in their use for detection and animal ID. New applications that use camera-trap datasets~\cite{swanson_snapshot_2015,ancrenaz_handbook_2012,forrester_emammalcitizen_2014,maputla_calibrating_2013,norouzzadeh_automatically_2018} for training show that algorithms can successfully classify camera-trap imagery with computer vision and be a foundation for count-based population estimates.
\subsubsection{Bootstrapping, Active-learning \& Instance-based Learning}
A good part of the work in this dissertation is concerned with curating animal ID datasets. However, the protocols surrounding the collection of hand-labeled ground-truth bounding boxes share similarities with bootstrapping detection algorithms~\cite{eggert_benefit_2015,chen_webly_2015,tong_salient_2015,wan_bootstrapping_2016} that perform weakly-supervised learning~\cite{li_weakly_2016,oquab_weakly_2014}. One highlighted example is the Annotation Interface for Data-driven Ecology (AIDE) project~\cite{kellenberger_aide_2020} that allows for the machine learning models to be quickly trained as annotated data is being generated, similar to instance-based learning algorithms~\cite{aha_instance-based_1991}. The proposed method uses whole-image species classifications to train a whole image classifier and limited human interaction to refine proposed bounding box candidates. This technique can be viewed as a relaxation of one-shot~\cite{zeiler_visualizing_2014,fei-fei_one-shot_2006,thrun_is_1996} and few-shot~\cite{xu_few-shot_2016} learning. Most strikingly, the bounding box refinement problem has been addressed by~\cite{papadopoulos_we_2016} that shows meaningful speedups in human interactions compared to bounding box regression by hand.
\subsection{Two-Stage Detection with Region Proposals}
The earliest deep learning approaches in object detection were created to quickly capitalize on the wild success of their respective winning image classification methods~\cite{krizhevsky_imagenet_2012,sermanet_overfeat:_2013,gouk_fast_2014,szegedy_deep_2013}. For example, the early winners of the ILSVRC image classification challenge also saw winning detection solutions by densely applying their neural networks with a sliding window across the image. These methods were relatively crude as they did not fundamentally address detection as a separate task but simply as a brute-force reformulation of the image classification task. These types of two-stage detectors became popular, however, as classification accuracy rapidly improved. A two-stage detector uses an algorithm to solve the localization problem first and feeds candidate bounding boxes to a second algorithm for classification (or suppression). We will explore both salient-based bounding box localization algorithms and deep neural networks that can be used to propose regions around objects for use in a two-stage detection process.
\subsubsection{Deep Saliency \& Attention}
In computer vision, the concept of saliency (or ``visual saliency'')~\cite{chang_fusing_2011,liu_learning_2011,wang_detect_2018} is the idea that particular objects or items in an image draw a significant amount of attention from the eye. For example, attention is generally pulled to subjects in motion, the most prominent object in the frame, or an object that ``pops out'' with an abnormal appearance~\cite{wang_familiarity_1994}. The critical insight is that salient object detection is class agnostic, and an algorithm can be trained to predict a set of \textit{classless} bounding boxes around things of interest. The salient bounding boxes are then given to a second image classification network to construct the final object detections (two-stage detection). Various pre-deep learning methods have been used for salient object detection, including the use of minimum spanning trees~\cite{tu_real-time_2016}, edges~\cite{zitnick_edge_2014}, BInarized Normalized Gradients (BING) by Cheng \textit{et al.}~\cite{cheng_bing:_2014} for speedy region proposals, and bottom-up segmentation algorithms like Selective Search by Uijlings \textit{et al.}~\cite{uijlings_selective_2013}.
Object saliency with deep learning, also known as deep saliency~\cite{he_supercnn:_2015, li_lcnn:_2015,borji_salient_2015,jiang_salient_2013,liu_ssd:_2016,li_deepsaliency:_2016}, has shown to be a powerful tool for suggesting candidate bounding boxes for detection. The work by K\"{u}mmerer \textit{et al.}~\cite{kummerer_deep_2014,kummerer_deepgaze_2016} began the first steps of exploring deep saliency with their Deep Gaze network, which borrowed the architecture and transferred weights of AlexNet~\cite{krizhevsky_imagenet_2012} to create a saliency map of the input image. The parallel work by Liu \textit{et al.}~\cite{liu_dhsnet_2016} on deep hierarchical saliency network (DHSNet) was also among the first to train an end-to-end neural network to produce saliency maps. AttentionNet by Yoo \textit{et al.}~\cite{yoo_attentionnet:_2015} worked in a slightly different manner in that it aggregated many different sources of salient and weak detection outputs to construct its final detection predictions. Work has also been done to combine local and global contextual information for more accurate saliency maps~\cite{zhao_saliency_2015,chu_multi-context_2017,wang_attention-based_2017,spain_modeling_2011}, take advantage of an attention mechanism more directly~\cite{yoo_attentionnet:_2015,zhang_progressive_2018,hara_attentional_2017,kosiorek_hierarchical_2017,wang_survey_2016,wang_deep_2018}, support multiple resolutions~\cite{liu_mr-cnn_2019,wang_salient_2019,zhao_pyramid_2019}, and be able to run in real-time applications~\cite{diao_efficient_2016, fan_shifting_2019,liu_simple_2019}. The Annotation of Interest (AoI) classifier, presented as a component of the detection pipeline in Chapter~\ref{chapter:detection}, has an architecture that is structurally similar to Overfeat~\cite{sermanet_overfeat:_2013} but is trained on an objective that is more closely related to deep saliency and attention networks.
\subsubsection{R-CNN \& Region Proposal Networks (RPN)}
Region Proposal Networks (RPNs)~\cite{cho_unsupervised_2015,bazzani_self-taught_2016} are specialized neural networks that separate the classification task from object detection and focus on only the localization of bounding boxes. RPNs share a similar design goal with object saliency; both are trying to propose class-agnostic bounding box locations for objects, but with the distinction that RPNs often share weights with a neural network image classifier. Similar to salient object detectors, the proposed regions are classified using an image classification neural network to form the final detections.
One of the first neural networks to use region proposals was Region-based Convolutions Neural Network (R-CNN)~\cite{girshick_rich_2014}. The Selective Search~\cite{uijlings_selective_2013} algorithm was used initially as input to R-CNN as a preceding region proposal algorithm, but it was prohibitively slow and could obviously not share weights. As a result, the architecture of R-CNN was updated to add a dedicated RPN neural component for object localization~\cite{girshick_fast_2015} alongside an updated classification component~\cite{he_deep_2015}. The downside of this network was that the two-branch structure (RPN and classifier) meant that it needed to be trained with an alternating procedure, optimizing either the RPN or the classifier at a given time. This design led to training instability but was still a meaningful improvement in speed and accuracy over using an external region proposal algorithm. A third iteration of the R-CNN detector (called Faster R-CNN~\cite{ren_faster_2015}) uses a combined training procedure and is the winner of several tasks in ILSVRC 2015. The work in this dissertation analyzes Faster R-CNN's two-stage detection performance using an off-the-shelf implementation.
The design of having a separate component within the network to produce bounding box proposals has been explored by other works. For example, the DeepProposal model by Ghodrati \textit{et al.}~\cite{ghodrati_deepproposal_2015} and Feature Pyramid Networks (FPN) by Lin \textit{et al.}~\cite{lin_feature_2017} uses the intermediate activations between layers of an image classification network to find potential object candidates at various scales and perform Non-Maximum Suppression (NMS) to produce a final set of boxes for classification. The refinement of the R-CNN approach also continues by taking better advantage of the image classifier by training the network to work as a cascade of classifiers~\cite{cheng_revisiting_2018,cai_cascade_2017}, where earlier layers discard easy negatives and focus on parts while deeper layers can specialize on large objects.
\subsection{Single-Stage Detection} \label{sec:ssd}
Single-stage detectors (also known as single-shot detectors)~\cite{erhan_scalable_2014} take a step back and examine what the best neural network structure should be for a detector without being dependent on preconceived designs inherited from image classification networks. In contrast with two-stage detectors, single-state detectors predict a single, combined result of bounding boxes and classifications without needing two inference steps or intermediate region proposals.
\subsubsection{You Only Look Once (YOLO)}
One of the first neural network solutions that was able to train a unified region proposal component with object classification is called, humorously, You Only Look Once (YOLO) by Redmon \textit{et al.}~\cite{redmon_you_2016}. The YOLO network is designed to predict an \textit{N}x\textit{N} grid of cells (typically 7x7) where each cell assigns itself an object classification label and produces \textit{M} bound box predictions. Each bounding box has a 4-tuple regression prediction for the box's location and a salient ``objectness'' confidence score (similar to~\cite{kuo_deepbox:_2015}). The final predicted bounding boxes are generated by multiplying the classification label scores for each cell by the object confidence scores for each of its bounding boxes. The ability to train YOLO as a unified pipeline makes it advantageous for real-world applications due to its efficiency and lack of additional training infrastructure (no need for alternating between branches during training like R-CNN). Due to YOLO's relatively simple network architecture without an RPN, its authors reported real-time performance using GPUs.
However, YOLO's integration of bounding box predictions into a unified network comes with downsides: a complex loss function, additional hyper-parameters, an unpredictable error gradient at the start of training (which often diverges), and a lack of multi-resolution detections. To address training instability, YOLO uses transfer learning and a process called ``burn-in'' that starts with a relatively small learning rate to warm up the network before the actual training. YOLOv2~\cite{redmon_yolo9000:_2016} was introduced to address common failures made by the original network; YOLOv2 adds Batch Normalization to increase training stability and lessen the need for burn-in, adds training and inference at multiple scales, and starts using anchor boxes~\cite{yu_unitbox_2016}. An \textit{anchor box} is defined as one of \textit{k} centroids when the ground-truth bounding boxes are clustered. The use of anchor boxes allows the model to focus on regions and sizes of boxes that are likely to be seen instead of attaching them to an arbitrary underlying grid cell. Finally, YOLOv3~\cite{redmon_yolov3_2018}) was introduced to modernize the approach of YOLOv2 with a better feature extraction backbone using ResNets and adds support for three separate scales of predictions to localize smaller objects better (similar to~\cite{li_scale-aware_2019}). For this research on the detection pipeline, the YOLOv2 model is analyzed against Faster R-CNN for animal detection.
\subsubsection{Single-Shot Detectors}
Shortly after YOLO was published, the Single Shot Multibox Detector (SSD) by~\cite{liu_ssd:_2016} was introduced as an alternative single-shot detector. The main difference between SSD and YOLO is that it uses a fully convolutional neural network (FCNN)~\cite{long_fully_2015} while still being able to achieve real-time detection performance. The accuracy was also a bit higher than YOLO (version 1), and it rivaled two-stage detection approaches like Faster R-CNN in terms of accuracy while also being substantially faster at inference. The design of SSD, like others~\cite{shen_dsod_2017,li_tiny-dsod_2018,lin_focal_2018,bell_inside-outside_2016}, takes advantage of a unified convolutional structure and introduces bounding box prediction at intermediate layers for multi-scale detections. Other approaches use Receptive Field Blocks~\cite{liu_receptive_2018} to enhance feature selection for object detection, and the Trident Network~\cite{li_scale-aware_2019} approach learns a three-branch, single-shot neural network that generates small, medium, and large bounding box predictions. More recent single-shot detectors attempt to remove the need for anchor boxes entirely and instead use keypoint triplets~\cite{duan_centernet_2019} or hourglass designs~\cite{melekhov_image-based_2017,newell_stacked_2016,yang_stacked_2017}.
\subsection{Semantic \& Instance Segmentation}
Novel bounding box proposal and single-shot networks became less frequent around 2018 and 2019 as incremental improvements to object detection performance diminished. The fundamental problem is that bounding boxes are rigid and limiting shapes -- detection failures became more nuanced~\cite{redmon_yolov3_2018} because boxes are sometimes hard to draw and locate consistently. It was clear that to advance the state-of-the-art for object detection, a reformulation of the objective was needed: the community needed better, more precise bounding boxes. It is not so much that existing bounding boxes in large datasets were not labeled correctly, but rather that bounding boxes were too coarse of a concept, and access to more intimate details was needed.
Semantic segmentation is the task of labeling the exact pixels that belong to a given class category. Semantic segmentation has historically been used as a means for object detection~\cite{yang_object_2014,pinheiro_learning_2015,hariharan_simultaneous_2014,fragkiadaki_learning_2015,hu_fastmask_2017}, locating parts~\cite{chai_symbiotic_2013}, and have been implemented using a range of techniques, including Fisher vectors~\cite{cinbis_segmentation_2013}, fully connected CRFs~\cite{chen_semantic_2014}, and graphs~\cite{felzenszwalb_efficient_2004}. For example, given a picture of Times Square in New York City, we could ask a person to paint all cars with red paint, buildings with blue paint, sky or water with yellow paint, road and sidewalks with purple paint, people with green paint, and everything else with orange paint. The goal would be to paint every pixel in the image with an assigned color. If we want to segment out each unique car in the image, however, painting all of the cars with a single red color offers insufficient detail to perform the task. Instance segmentation is an enhancement of semantic segmentation where each instance of a given class is also annotated. In our New York example, an instance segmentation would ask a computer to color all cars with different shades of red so that the boundary for all cars is defined down to the pixel.
The required level of detail for segmentation is much more involved and precise than drawing a bounding box for each object, making it much slower to gather. The success of segmentation techniques has been parallel to the creation of large datasets like Microsoft's Common Objects in Context (COCO) dataset~\cite{lin_microsoft_2014} that have spent the time to add instance-level segmentations for a large number of images and classes. Likewise, other methods have shown that it is possible to simulate color images and ground-truth segmentation data for training~\cite{dosovitskiy_carla:_2017,shah_airsim:_2017,qiu_unrealcv:_2016}. While this dissertation does not use semantic or instance segmentation techniques, it is related to the coarse background segmentation component in the detection pipeline. The results reported here suggest that instance segmentation will allow for even more automated photographic censusing methods in the future. However, the resources and funding of conservation groups are often minimal, and it is difficult to realistically expect fully segmented ground-truth to be annotated at large scales for novel species. To maximize the real-world usefulness of the methods presented here, the focus on using annotated bounding boxes (with select metadata) is key to keeping them adaptable for new species and a feasible option for wildlife conservation groups.
\subsubsection{Fully Convolutional Neural Network (FCNN)}
A Fully Convolutional Neural Network (FCNN), introduced by Long \textit{et al.} in~\cite{long_fully_2015}, is a special type of neural network that has no fully connected dense layers. The benefit of having no dense layers is that the network is not rigidly set to a fixed input or output size. This feature can be exploited by applying the network in a fully convolutional fashion across a larger input image implicitly, and the network does not need to resort to any type of fixed-sized sliding window or shift-and-stitch techniques~\cite{sermanet_overfeat:_2013,gouk_fast_2014}. The FCNN has similarities to the All Convolutional Network by Springenberg \textit{et al.}~\cite{springenberg_striving_2014} in that the network architecture is comprised entirely of convolutions with no fully connected dense or pooling layers. The design of the FCNN makes it a flexible platform for image classification, region-based object detection~\cite{dai_r-fcn_2016}, and a natural candidate for segmentation~\cite{singh_r-fcn-3000_2018}. The detection pipeline has a coarse background classifier that is implemented as an FCNN and uses semi-supervised learning~\cite{zhu_introduction_2009} on bounding boxes. There is not currently a component in the detection pipeline that relies on having full object segmentations for training data because rectangle bounding boxes are sufficient for all training.
\subsubsection{U-Net \& Mask R-CNN}
The work of Ronneberger \textit{et al.}~\cite{ronneberger_u-net:_2015} proposed the novel U-Net architecture with its convolution, embedding, and up-scaling layers. U-Net uses a single-shot process to generate semantic segmentations directly from input images. The network shares outputs from the convolutional feature maps to their corresponding up-scaling segmentation maps for the same resolution. The use of up-scaling branches led to further development of de-convolutions~\cite{zeiler_deconvolutional_2010,fu_stacked_2019,cai_unified_2016} and their use in semantic segmentation. Furthermore, the work by Yu \textit{et al.}~\cite{yu_dilated_2017} on dilated residual networks allowed the network to learn how to effectively up-scale images. As for two-stage segmentation methods, Mask R-CNN~\cite{he_mask_2017} extended the author's previous work on R-CNN to produce a semantic segmentation as outputs of the RPN. The Detectron~\cite{girshick_detectron_2018} approach uses existing bounding boxes or rough semantic segmentations to create instance segmentations. The approaches of U-Net and Mask R-CNN are very popular (with over 28,000 and 12,000 citations, respectfully) and have been used on animal detection~\cite{brunger_panoptic_2020,rozsivalova_counting_2020,singh_animal_2020} and aerial counting~\cite{sarwar_detecting_2018,xu_automated_2020,barbedo_study_2019}.
\section{Animal Re-Identification \& Population Estimates}
Human re-identification (re-ID, also referred to as ``biometrics'')~\cite{zhang_alignedreid_2017,huang_labeled_2008,huang_labeled_2014,learned-miller_labeled_2016} has long been the interest of computer vision applications and has natural cross-applications with animal re-identification. While the image classification and object detection techniques we have discussed can find animals and determine their species, it is difficult to apply these concepts directly to identifying unique individuals. New algorithms are therefore needed to solve animal identification as a dedicated task. For example, a detection process is still needed to filter relevant images and sightings of animals. The job of an identification procedure is to build a searchable database of repeat sightings of the same animal and calculate a population estimate.
Historically, population estimates have been done entirely by hand, using counting-based methods~\cite{simpson_zooniverse:_2014,swanson_snapshot_2015,forrester_emammalcitizen_2014,chase_continent-wide_2016}, physical tags or collars~\cite{juang_energy-efficient_2002,mukinya_identification_1976,alexander_african_1994,mech_critique_2002,thouless_long_1995}, or manual description codes~\cite{lahiri_biometric_2011,patrick_demographic_2003,sikes_guidelines_2011}. These estimates are typically custom, one-off efforts and do not have uniform collection protocols or data analysis. Because datasets are often curated by hand, they tend to be focused on a small number of individuals~\cite{schneider_similarity_2020} or focus on animals with few repeat sightings~\cite{polzounov_right_2016}. One of the most challenging barriers with performing population estimates with deep learning is that there is a structural mismatch in target species between large datasets for animal re-ID (that show \textit{pre-cropped} images for at least hundreds of individuals with repeat sightings of each animal over time)~\cite{li_atrw_2019,korschens_elpephants_2019,tausch_bumblebee_2020} and public datasets for animal detection (with at least thousands of annotations and original images that are seen in different locations, but without ID)~\cite{swanson_snapshot_2015,beery_iwildcam_2019,khan_animalweb_2019,nugent_inaturalist_2018}. Attempting to build deep learning algorithms for a single species can be severely limited by not having access to large-scale datasets for both the detection and identification tasks. As presented in this dissertation, the concept of photographic censusing is a bootstrapable and end-to-end framework for generating ground-truth animal detection datasets with curated animal IDs.
While this dissertation does not contribute new animal identification methodologies, it does use them in its photographic censusing process. A brief overview of animal identification is given below, but the reader is encouraged to explore a more comprehensive history provided by Ravoor \textit{et al.}~\cite{ravoor_deep_2020}, Hoeim \textit{et al.}~\cite{hoiem_diagnosing_2012}, and Weinstein~\cite{weinstein_computer_2018}.
\subsection{Animal ID Ranking \& Verification}\label{sec:id_verification}
Animal re-identification (also known as ``animal re-ID'')~\cite{cheema_automatic_2017} can be broken up into two tasks: ranking and verification. Identification ranking~\cite{crall_hotspotter_2013,weideman_integral_2017,moskvyak_robust_2019,matthe_comparison_2017} is the process of querying the image of an animal against an existing search database of previous encounters to find visual-based matches. The most confident matches are returned in rank order, with the highest-scoring database example in position one (i.e., rank-1). Identification verification~\cite{mandal_prediction_2010,lu_surpassing_2015,sengupta_frontal_2016,ramanathan_face_2006,taigman_deepface_2014,kumar_attribute_2009} is quite different as there is no need for searching: verification asks if two presented animals are the same or not, regardless of why the pair is being compared or how it was found. For example, if you were given a grainy photo of a person's face and a pile of 100 driver licenses, you could rank the licenses according to the people you felt were the closest to matching the reference image. Maybe you would first partition them by gender, then sort by age, then organize by skin color, etc.\ and then narrow the candidates to the handful you felt were the most likely. Likewise, you could also be given the same grainy face photo and one license and asked to make a \textit{yes} or \textit{no} decision on if those two photos represent the same person. We can realistically expect an ID verification algorithm to be much faster than ID ranking; ranking images with a verifier through brute-force is possible but can quickly become infeasible as the database grows. In other words, both tools are useful for human and animal ID as they can optimize for two very different goals. If both of these tasks work relatively well for a given animal species, it is possible to build automated systems that can generate a population estimate, as this research will demonstrate.
The challenge for identification ranking is that not all species show the same kinds of visual information for matching, even though texture-based matching is successful across a variety of species~\cite{barord_comparative_2014,chehrsimin_automatic_2018,miguel_identifying_2019,morrison_individual_2016,nipko_identifying_2020,park_where_2019,patel_shot_2020,shukla_hybrid_2019,weinstein_m_2015}. For example, a zebra has high-contrast stripe textures visible across the body that do not change over the life of the animal, a perfect example of a species that can be matched with visual ID~\cite{crall_hotspotter_2013,hiby_computer_1990,berger-wolf_wildbook_2017,lea_non-invasive_2018,oddone_mobile_2016}. On the other hand, a green sea turtle has lots of texture on its shell, but those patterns change slowly over time (like rings of a tree). The overall color and appearance of a sea turtle shell can also change based on the animal's diet. The face and flippers of a sea turtle, however, are covered with small patches (called ``scutes'') that are reliable for pattern-based ID~\cite{dunbar_hotspotter_2017,dunbar_hotspotter_2021}. It is important to recognize that not all parts (like a shell) of an animal are reliably useful for ID over time. Some species may require more specialized attention by a detector to find specific parts of the animal.
Like a bottle-nose dolphin or an African elephant, some animals do not have stripes, spots, or intricate patterns for pattern-based identification. The lack of texture, however, does not necessarily make these species unidentifiable. Rather, it asks if different paradigms of ID algorithms can make ID work for those species. Animal ID algorithms can be designed to focus on identifiable features like the outline of a dorsal fin~\cite{hughes_automated_2017}, the jagged nicks and notches of a whale fluke~\cite{blount_flukebookcontinuing_2020,blount_flukebookrecent_2019,calambokidis_update_2017,franklin_photo-identification_2020}, or a fanned-out ear of an elephant~\cite{weideman_contour-based_2019,kulits_elephantbook_2021}. Animals that do not have intricate patterns or detailed contours (i.e., local features) may still offer large structures or definition-less blob patterns (i.e., global features) that can be used for ID. For example, the bonnet callosity pattern of right whales~\cite{polzounov_right_2016,bogucki_applying_2019,kabani_improving_2017,kabani_north_2016,norman_does_2016}, or the Rorschach-like underbellies of giant manta rays~\cite{moskvyak_robust_2019}, or the unique constellations of whale shark spots~\cite{araujo_getting_2020,mckinney_long-term_2017,batbouta_computer_2017} can be used for recognizing and distinguishing individual animals.
It also seems evident that some species, like the American red squirrel (\textit{Tamiasciurus hudsonicus} or Grant's gazelle (\textit{Nanger granti}) in Africa, are simply beyond the practical ability of visual ID to recognize individuals. We should recognize that the abilities of any visual ID algorithm are fundamentally tied to a human's ability to confidently decide if two sightings show the same animal or not. If a human was presented with two images of squirrels, it seems improbable that a reliable ``same'' or ``different'' decision could be made without the aid of scarring or a deformity. This begs the question, ``\textit{how could a ranking algorithm's results, even from a perfect oracle, be trusted if a human was unable to tell if the rank-1 match was correct or not?}'' While we can consider ID ranking to be a \textit{super-human} task -- something that is expected to surpass human-level performance -- a human's ability to verify pairs should be a bellwether for ID feasibility. If humans cannot accurately verify pairs of annotations for a species, then that species is \underline{categorically incompatible} with visual ID methods and is a better candidate for a more invasive or abundance-based ID alternative. The methods described here consider unidentifiable animal species outside of the problem scope for visual population monitoring and photographic censusing.
Other than body texture and edge contours, other approaches have treated animal ID in a similar way to human face ID~\cite{winckler_comparison_2005}. Animal faces have been shown to be trackable in video frames~\cite{burghardt_tracking_2004,burghardt_analysing_2006} and moderate success has been shown when applying modified face ID algorithms to chimpanzees~\cite{deb_face_2018,freytag_chimpanzee_2016, schofield_chimpanzee_2019}. The biggest issue with chimp face ID is that the populations are fairly small, and the broader impact on other species is not very well understood. Apes are not the only candidate for face ID; lemurs~\cite{crouse_lemurfaceid_2017} have also worked with face ID methods and the whisker patterns of brown bears~\cite{clapham_automated_2020}, polar bears with HAAR-features~\cite{lienhart_extended_2002, mita_joint_2005}, and lions~\cite{kerr_facebook_2015} have shown success for identification.
The various methods for animal ID are not a direct focus of this dissertation, but some baseline algorithms are needed to demonstrate the impact and success of the contributed methods. Some algorithms, like triplet-loss networks~\cite{dong_triplet_2018,hermans_defense_2017,schroff_facenet_2015}, require significant amounts of training data and need to be bootstrapped by algorithms that do not rely on deep learning. The following algorithms were co-developed with the detection pipeline and photographic censusing methodology presented in this work and are selected as representatives for detailed analysis:
\begin{enumerate}
\item HotSpotter~\cite{crall_hotspotter_2013} - a texture-based ranking algorithm that uses local features on areas with sharp changes in contrast. This algorithm uses SIFT features~\cite{lowe_distinctive_2004} at its foundation and does not need to be trained with a deep-learning algorithm, meaning it can be run on new species out-of-the-box with minimal tuning.
\item CurvRank~\cite{weideman_extracting_2020} - a curvature-based ranking algorithm that matches local segments of an edge contour. This algorithm requires training data to predict outline contours but does not rely on comprehensive ID data for training. While this algorithm cannot be run on new species completely natively, it can cross-apply its pre-trained models on similar features (i.e., dorsal fins look very similar, regardless of species).
\item Verification Algorithm for Match Probabilities (VAMP)~\cite{crall_identifying_2017} - a random forest verification algorithm that uses hand-engineered features for comparing two sightings. This algorithm does require training data for ID comparisons but can be trained from a small (and converged) database of animals due to its data mining procedure.
\item Pose-Invariant Embeddings (PIE)~\cite{moskvyak_robust_2019} - a triplet-loss algorithm that creates a global embedding feature for distance-based ranking \textit{and} verification. This algorithm is often the most accurate for a given species but requires extensive training data to train. New species cannot be ranked (or verified) by PIE until an algorithm like HotSpotter or CurvRank builds a preliminary dataset of IDs first that can be used to train the feature extraction and embedding.
\end{enumerate}
\noindent The proposed components and methods in this dissertation are designed to be modular and general-purpose and may be used with other ranking or verification algorithms. An overview of these four algorithms (and their related work) is offered below.
\subsubsection{HotSpotter \& VAMP}
The work by Crall~\cite{crall_identifying_2017} performs texture-based animal ID ranking by comparing SIFT descriptors~\cite{lowe_distinctive_2004} that are extracted at keypoint locations~\cite{perdoch_efficient_2009} for an annotation. Foreground-background segmentations from the detection pipeline (see Section~\ref{sec:background}) are used to weight these extracted keypoints, and the resulting descriptors are gathered into an approximate nearest-neighbor (ANN) search data structure~\cite{muja_fast_2009}. A new annotation can then be queried against the ANN index to find descriptors similar to others in the database. Matches in the sparser regions of descriptor space (i.e., those that are most distinctive) are assigned higher scores using a ``Local Naive Bayes Nearest Neighbor'' method~\cite{mccann_local_2012}. The scores from the query that match the same individual are accumulated to produce a single score for each animal. A post-processing step then spatially verifies the matches and re-ranks the returned list of individuals~\cite{philbin_object_2007} (as will be defined and discussed in Chapter~\ref{chapter:ca}).
In addition to the HotSpotter ranking algorithms, the Verification Algorithm for Match Probabilities (VAMP) verification algorithm was also developed by Crall~\cite{crall_identifying_2017}. VAMP is trained as a random forest classifier~\cite{breiman_random_2001,pal_random_2005} on a hand-engineered feature vector and produces a decision of ``same animal'', ``different animals'', or ``cannot tell'' for a pair of annotations. The model is swift and is relatively accurate for well-formed annotations.
\subsubsection{CurvRank}
The CurvRank algorithm by Weideman~\cite{weideman_integral_2017,weideman_contour-based_2019,weideman_extracting_2020} uses a U-Net~\cite{ronneberger_u-net:_2015} architecture to extract a coarse contour and a self-supervised~\cite{kolesnikov_revisiting_2019} CNN to refine that edge into a fine contour. The contour is then converted into a series of descriptors with a novel, digital curvature-based feature extractor. The descriptors are placed into a nearest neighbors search structure and matches can be queried. A similar algorithm called FinFindR~\cite{blount_flukebookcontinuing_2020,thompson_finfindr_2019} also works on extracted contours and uses A*~\cite{hart_formal_1968} to produce a trailing-edge segment for dorsal fins. The algorithm then uses a pre-trained classifier to recognize a unique fixed set of individuals, requiring substantial training data and a need to retrain periodically.
\subsubsection{Pose-Invariant Embeddings (PIE) \& Triplet-Loss Networks}
One of the most recent techniques to perform animal ID ranking is a triplet-loss network~\cite{dong_triplet_2018,hermans_defense_2017,schroff_facenet_2015}. A triplet-loss network aims to learn how to represent an animal's identity directly and extract a feature embedding that can be compared with other embeddings (without the need for normalization). This design has seen success in animal classification by normalizing the pose of birds~\cite{branson_bird_2014} and was cross-applied to instance recognition (i.e., re-identification) for animals~\cite{schneider_similarity_2020,nepovinnykh_siamese_2020,dlamini_automated_2020}. In contrast to HotSpotter or CurvRank, the intermediate features and descriptors cannot be visualized, but the distance between two features does not need to be normalized before clustering. The ability of triplet-loss networks to learn a global feature embedding makes it generally more accurate and faster than methods that use hand-engineered features. However, it comes at the cost of needing large amounts of training data.
Triplet-loss networks are an enhancement of Siamese networks~\cite{melekhov_siamese_2016,varior_gated_2016,varior_siamese_2016} and are trained by mining a triplet pair consisting of a reference image, a positive example, and a negative image. During training, the network is tasked with learning how to do feature extraction for embeddings; ideally, the distance between the reference and the positive embeddings should be small, while the distance for the reference and negative pair should be large. The Pose-Invariant Embeddings (PIE) algorithm ~\cite{moskvyak_robust_2019} has an additional component that allows multiple poses for the same animal (left and right) to be learned within the same model. This dissertation uses HotSpotter and PIE to rank annotations of Gr\'evy's zebra and build a curated database. The VAMP algorithm is also compared against PIE as a verifier in an analysis of how much work can be automated during a population census.
\subsection{Animal Population Estimates}
The field of animal population estimation is much older than the era of deep learning, stretching back to 1896 and the work of Johannes Petersen and his mark-recapture ecological studies~\cite{petersen_yearly_1896} on European plaice (\textit{Pleuronectes platessa}). Since that time, various statistical techniques have been used for sampling animal populations and estimating error. The detection pipeline and other methods are designed to be used as black-box components within a larger censusing framework. Various frameworks~\cite{parham_photographic_2015,forrester_emammalcitizen_2014,berger-wolf_ibeis:_2015} in the conservation literature have included computer vision components as well.
\subsubsection{Capture-Mark-Recapture} \label{sec:capture-mark-recapture}
Mark-recapture is used to estimate the size of an animal population~\cite{robson_sample_1964,pradel_utilization_1996,chapman_fallow_1975,karanth_estimating_2012,white_program_1999}. Typically, a portion of the population is captured at one point in time, and the individuals are marked as a group. Later, a second population capture is performed, and the number of previously marked individuals is counted and recorded. Since the number of marked individuals in the second sample should be proportional to the number of marked individuals in the entire population (assuming consistent sampling processes and controlled collection biases), the size of the entire population can be estimated.~\cite{berger-wolf_wildbook_2017} The population size is estimated as the ratio of marked individuals during the first and second captures against the number of resighted individuals. Thus, the formula for the simple Lincoln-Petersen estimator~\cite{pacala_population_1985} is:
\begin{align}
\begin{split}
N_{\textrm{est}} &= \frac{K*n}{k}
\end{split}
\end{align}
\noindent where $N_{\textrm{est}}$ is the population size estimate, $n$ is the number of individuals in the first capture, $K$ is the number of individuals from the second capture, and $k$ is the number of \textit{recaptured} individuals that were marked from the first capture. There also exist more sophisticated extensions to the formula that account for various known sources of error~\cite{seber_estimation_1982,chapman_fallow_1975,buckland_quantifying_1991}.
Applying the Lincoln-Petersen estimator requires that several assumptions be met. The estimator expects that no meaningful births, deaths, immigrations, or emigrations have taken place. Further, the sightability of individuals must be equal between photographs. Sampling back-to-back days reduces the likelihood of violating the first two assumptions for most large mammal species. For photographic censusing, we can assign multiple teams of volunteers to traverse the same survey area to attempt to increase the overall number of sightings. More sightings on the first day mean better population coverage and increased resightings on the second day give a more confident population size estimate. By intensively sampling a survey area with many photographers (that may haphazardly overlap), the expectation for equal sightability is high and identical for any given individual in the population. Therefore, all of the required assumptions for the Lincoln-Petersen estimator can be satisfied for a photographic census. A two-day collection is structured into a public ``rally'' that focuses specifically on upholding these sampling assumptions and coordinating the help of volunteers.
This work explores a passive variant of mark-recapture that is based entirely on photographs called sight-resight~\cite{bolger_computer-assisted_2012,hiby_analysis_2013}. The entire photographic censusing technique can be viewed as an automated and large-scale implementation of a sight-resight study. By tracking individuals, related to~\cite{jolly_explicit_1965,seber_note_1965}, the proposed method can make more confident claims about the population. The more individuals that are sighted \textit{and} resighted, the more confident the population estimate and robust the ecological analyses will be.
\subsubsection{Graph ID \& Local Clusters and Their Alternatives (LCA)}
We now must consider how to associate and curate annotations into their respective IDs accurately. The immediate question is, ``\textit{how do we use animal ID ranking and verification algorithms as tools to build a database of animal IDs?}'' One na\"ive solution is to begin with an empty database and build it incrementally by adding one annotation at a time. Each time a new annotation is added, it is expected to be matched against the current database. The ranked ID results for the new query annotation can be passed to a verification algorithm to 1) automatically decide which database annotations (by pairing them up with the original query annotation) show the same animal or 2) filter and reorder the ranked results for human review. At any point, a human reviewer could also be presented with the same pair of query and database annotations as the verification algorithm to get a ground-truth decision. This design allows for human-in-the-loop~\cite{zanzotto_viewpoint_2019,xin_accelerating_2018,xin_helix_2018,kleinberg_human_2018} verification of the database as it grows, and human reviewers can be used to help correct for any errors made by the underlying machine learning algorithms~\cite{jiang_identifying_2020,gu_understanding_2019,schelter_taming_2020}. If a confident match is found, it is added to an existing ID in the database. Otherwise, if no match is found, then a new ID is added to the database. This process is termed \textit{one-vs-many agglomerate matching} and is one of the easiest to implement for large animal databases\cite{berger-wolf_wildbook_2017}.
This process, however, does not have any built-in way to identify and correct ground-truth errors in the database. Database errors can be introduced and may accumulate over time if the ranking algorithm fails to retrieve a correct match from the database where one exists (false negative). An error may also be introduced when a verifier automatically decides that annotations for two different animals are the same individual (false positive). A human reviewer can also make mistakes and, for example, could decide that two annotations of the same individual are different animals (erroneously increasing the total population size by 1). As the database grows, ID mistakes can become non-trivial in size and sometimes require substantial amounts of effort to fix. One example of such a database mistake is a ``snowball''. This type of error can be expected for herding species where annotations overlap and is created when two actual individuals are incorrectly matched together under the same ID label. The error, in turn, makes it more likely for a third individual to be matched as the same name in the database, and so on until many individuals are represented by one name label (decreasing the population size). Fixing this type of error is laborious because it requires the one big name to be split into an unknown number of smaller names for each distinct individual. When we constrain ID matching to only an agglomerate process -- always making new animal IDs or adding to existing animal IDs -- it becomes exceedingly difficult to know if (or indeed how many) errors there are in the underlying database over time.
The end goal of photographic censusing is to create a consistent database of individuals and their respective sightings. This database can be used to estimate the number of animals in the overall population, which can be sensitive to systematic ground-truth errors in the ground-truth ID database. Leaving these errors unaccounted for and unresolved may end up skewing the direction or urgency of conservation action, so it must be addressed. What is needed is an overarching management algorithm that can continually curate an existing database and use \textit{many-vs-many reinforcement matching} to run consistency checks on its current IDs. This database consistency problem is important enough for accurate population monitoring that it demands a dedicated solution, and two algorithms are analyzed by this dissertation: Graph ID~\cite{crall_identifying_2017} and LCA. These algorithms are responsible for ensuring that the current state of the database is trustworthy by enforcing a level of self-consistency. As database errors are found and fixed, the management algorithm should also decide which pairwise verification decisions to send to a human and control how much automation there is during the curation of the database. This type of review is similar to active-learning~\cite{cohn_improving_1994,cohn_active_1996,settles_active_2009,oh_study_2021} since the updated ground-truth IDs can be used to iteratively re-train the underlying machine learning algorithms~\cite{lindskog_time_2021} and improve the overall estimate. The process of continual curation also shares similarities with database visualization for consistency checking~\cite{qin_making_2020} and ground-truth data debugging~\cite{rezig_dagger_2020,rezig_data_2019}.
The first algorithm, Graph ID~\cite{crall_identifying_2017}, allows for the state of a population of animals to be constructed as a graph of annotations (nodes) and pairwise decisions (edges). The nodes of the graph are all expected to be annotations that can be visually matched using a ranking algorithm. Decisions with three possible states represent the edges between two nodes: ``same animal'', ``different animals'', or ``cannot tell''. The goal of the Graph ID algorithm is to construct a consistent graph of positively connected components (PCCs) where there are only negative edges between all PCCs. The algorithm relies on a positive-redundancy measure within all PCCs and negative-redundancy between all matching PCCs to ensure that the database is in a consistent state. This need for explicit redundancy and the possibility of an incomparable (``cannot tell'') decision means that the algorithm stops all automated processing when an inconsistency is found, expecting a human reviewer to find and fix the issue. If the verification algorithm is not confident enough to decide a given pair, it is also given to a human for review. Likewise, if a PCC is inconsistent, all of its previously reviewed annotation pairs are given to humans for review until the error is found and resolved. Likewise, since the algorithm requires all (matched) PCCs to satisfy negative redundancy, there is a quadratic increase in the number of negative edges that need to be reviewed by humans. While redundancy is conceptually easy to understand, the Graph ID algorithm places an outsized focus on enforcing it and does not take full advantage of the automated verification algorithm.
The Local Clusters and their Alternatives (LCA)\footnote{\url{https://github.com/WildMeOrg/wbia-plugin-lca} (Accessed: Oct. 29, 2021).} algorithm is developed as an alternative to the Graph ID algorithm and makes better use of the automated verifier. The (experimental and yet-to-be-published) algorithm accomplishes this goal by shifting away from the concept of positive and negative connectivity. Instead, it attempts to measure a cluster's relative stability in comparison to alternative clusterings. In addition, LCA chooses to delay human decision-making for as long as possible. Further, it does not require consistency at all times (and forcing human decisions when a mistake is found). It instead relies as much as possible on automated decision-making to infer what the most likely resolution is. LCA will run a series of trials by splitting the cluster apart and measuring the coherence of a handful of alternatives, and only ask for a human decision when all of the various alternatives are too unstable. In practice, this drastically reduces the amount of human effort to curate a population graph and is a much more efficient algorithm for automated population censusing. While LCA is not a contribution of this dissertation, the work discusses how LCA behaves differently than the Graph ID algorithm and analyzes its failure modes. A large-scale experimental analysis of the LCA algorithm to verify ID datasets is a contribution of this work, as it presents an initial benchmark for the algorithm's performance compared to Graph ID.
\subsubsection{The Great Zebra \& Giraffe Count (GZGC) of 2015}
The formalized concept of a photographic censusing rally is a significant contribution of this work. A censusing rally is designed as a two-day event that focuses on collecting many images for a target species and attempts to survey its known geographic area. Citizen scientists~\cite{cohn_citizen_2008,irwin_citizen_1995,silvertown_new_2009} are used as volunteer photographers to increase the overall coverage of the surveyed area, distribute the workload, and overall make data collection more feasible. The image data collected by all participants are then analyzed by machine learning to produce a database of resident animals and estimate the size of the population.
One of the first real-world demonstrations of photographic censusing was The Great Zebra \& Giraffe Count (GZGC) of 2015 and is the focus of the author's master's thesis~\cite{parham_photographic_2015}. The GZGC censusing rally was a small case study performed within the Nairobi National Park in Nairobi, Kenya to estimate the local population size of plains zebra (\textit{Equus quagga}) and Masai giraffe (\textit{Giraffa tippelskirchi}). The primary goal of the GZGC was to prove the effectiveness of the general censusing procedure with quickly-trained volunteers and to test the workflow of using automated detection and ID algorithms for real-time feedback to participants. The insights and lessons learned from that event were applied during the Great Gr\'evy's Rally (GGR)~\cite{berger-wolf_great_2016,rubenstein_state_2018} to estimate the total number of Gr\'evy's zebra in Kenya. The details and analysis of the GGR photographic censusing rallies in 2016 (GGR-16) and 2018 (GGR-18) are the focus of Chapter~\ref{chapter:censusing}. To provide a quick summary: the two Great Gr\'evy's Rally events combined collected over 90,000 images and used over 350 participants, compared to around 9,000 images and 50 contributors during the GZGC.
\section{Summary}
The techniques proposed in this dissertation span the disciplines of computer science, computer vision, and ecology and are heavily motivated by the application of real-world population monitoring. The related machine learning work in image classification, object detection, and other semantic computer vision algorithms allows the automated processing of large volumes of images for photographic censusing. Separating the work responsibility into two stages -- a detection pipeline followed by a separate identification process -- is helpful since it allows for modularized development and dedicated attention when creating machine learning datasets.