-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSSD_
26 lines (24 loc) · 2.79 KB
/
SSD_
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
THE SSD ARCHITECTURE
SSD has two components: SSD head and a backbone model. Backbone model basically is a trained image classification network as a
feature extractor. Like ResNet this is typically a network trained on ImageNet from which the final fully connected classification
layer has been removed. The SSD head is just one or more convolutional layers added to this backbone and the outputs are interpreted
as the bounding boxes and classes of objects in the spatial location of the final layers activations.We are hence left with a deep neural
network which is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a
lower resolution. For an input image ,the backbone results in a 256 7x7 feature maps in ResNet34 . Grid cellInstead of using sliding window,
SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image.
Detection objects basically means predicting the class and location of an object within that region. Background class is considered
if no object is present and the location is ignored. For instance, we could use a 4x4 grid in the example below diagram.
Each grid cell is able to output shape of the object it and the position.Anchor box and receptive field come into play are used when
there are multiple objects in one grid cell or we need to detect multiple objects of different shapes. Anchor box:Multiple anchor/prior
boxes can be assigned to each grid cell in SSD. These assigned anchor boxes are pre-defined and each one is responsible for a size and
shape within a grid cell.Matching phase is used by SSD while training, so that there's an appropriate match to anchor box with the
bounding boxes of each ground truth object within an image. For predicting that object’s class and its location the anchor box with the
highest degree of overlap with an object is responsible.Once the network has been trained,this property is used for training the network
and for predicting the detected objects and their locations. Practically, each anchor box is specified with an aspect ratio and a zoom
level. Well,we know that all objects are not square in shape. Some are shorter ,some are longer and some are wider, by varying degrees.
The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this.The different aspect ratios can be specified
using ratios parameter of the anchor boxes associated with each grid cell at each zoom/scale level.
ZOOM LEVEL
It is not mandatory for the anchor boxes to have the same size as that of the grid cell.The user might be intrested in finding
both smaller or larger objects within a grid cell. In order to specify how much the anchor boxes need to be scaled up or down with
respect to each grid cell ,the zooms parameter is used