-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SpatialPyramidPoolingLayer with the Split, Pooling, Flatten & Concat layers #560
Conversation
I appreciate how quickly this contribution has appeared, but this should almost certainly be done by composition and not copy-paste. For example, consider the within-channel LRN https://github.com/BVLC/caffe/blob/master/src/caffe/layers/lrn_layer.cpp#L31 |
Tests passed but the gradient checks were slow.
|
@bhack, would you like to review if the implementation is consistent with the algorithm described in the section 2.3 of the SPP-net paper? |
@kloudkl I hope that i can do it this evening or tomorrow. |
(Sorry posted this before seeing recent changes, please ignore my previous post - deleted) |
@kloudkl I've not compiled the code to deeply trying it but seems that you have simply handled the concat between pooling layer and cumulating loss. |
I don't think multi-size training is blocked by the transformation layers. In the paper, the authors simulated multi-size training with multiple fixed-size networks. As the output vectors of the conv5 layers are pooled into fixed-length features by the SpatialPyramidPooling layer, the networks of different sizes can share the same fully-connected layers as their last layers. I prefer to follow the path of @moskewcz's #308 DenseNet feature pyramid computation. But their code seems too heavy weight to integrate with the SPP. More likely, I will implement the Caffe version of Torch7's PyramidPacker and PyramidUnpacker to extract features for multiple scales of an images as discussed in #189. |
@kloudkl Right |
const float spatial_bin_size = | ||
static_cast<float>(image_side_length) / spatial_bin; | ||
pooling_param->set_kernel_size(ceil(spatial_bin_size)); | ||
pooling_param->set_stride(floor(spatial_bin_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is written in Kaiming's paper, I guess there will be some problems with this pooling approach. For example, if image_side_length == 17
and spatial_bin == 6
, then you have kernel_size == 3
and stride == 2
, so you actually get 8_8 bins, instead of 6_6 bins. @kloudkl Could you tell me whether I am right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @kloudkl I emailed Dr. Kaiming He for details, and he told me that this is how they perform spatial pyramid pooling:
Denote the width and height of the conv5
feature maps (can be the full image or a window) as w
and h
. For a pyramid level with n
_n
bins, the (i,j)-th bin is in the range of [floor((i-1)_w/n), ceil(i_w/n)] \* [floor((j-1)_h/n), ceil(j*h/n)]
.
I copied this PR and currently I am trying to implement a PyramidLevelLayer to implement this pooling behavor, based on the rectangular pooling #614.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I realized the problem when I wrote the test cases.
Thank you for contacting the authors for clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm solving it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think the range above includes left border but excludes right border, i.e. [0, 3] contains 0, 1, 2 but not 3.
To be more faithful to the implementation of the authors of the SPP network paper, the pooling layer is extended to support floating point height and width of the kernel and stride. The 36 test cases of the pooling layer are all passed. The spatial pooling layer is also tested on both the CPU and the GPU.
|
Classification accuracy on the VOC 2012 dataset:
The spatial pyramid pooling layer consists of four pyramid levels each of which respectively splits the images into 1 , 2, 3, 6 patches evenly in both the vertical and horizontal directions. |
The SPP-net performed worse as the fully connected layer after the last convolution layer has larger dimensions with the reference imagenet model. Its parameters were randomly initialized and caused over-fitting on the relatively small VOC 2012 dataset. If it is first fine-tuned on a much larger dataset, its perfermance will certainly be superior as described in the paper. |
4278286
to
c01f07a
Compare
@@ -0,0 +1,328 @@ | |||
name: "ImagenetSpatialPyramidPoolingNet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering that, the voc2012 classification has multiple labels, how to do leveldb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDF5DataLayer
What's going on with this? Can I help? |
This algorithm involves some very complicated corner cases. For example, a candidate region in the original image may be mapped into a very small region with the width or height equal to or smaller than 1 pixel. It's very hard to detect objects whose sizes are small relative to the image. GoogLeNet combined with RCNN is a much more robust but much slower solution. In practice, you may find the object detectors included in the latest OpenCV quite handy for most use cases if you are required to quickly complete a project. |
@kloudkl I'm interested in helping. Maybe we can chat about what is holding up this PR. How can we do that? |
Closing since this PR is abandoned and the code is non-compositional. This is better achieved through layer composition with There is an expected replacement: spatial pyramid pooling has been given to a student as Caffe practice. |
@shelhamer Can you please update us on the current status of this? |
See #2177 for spatial pyramid pooling. |
The spatial pyramid pooling layer [1] mentioned in #548 is the combination of the existing PoolingLayer and ConcatLayer. It automatically computes the sliding windows sizes and strides for the multiple pyramid levels, applies the PoolingLayer on each level, and finally concatenates the outputs of all the levels into fixed-size vectors to feed into classifiers or fully connected layers.
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. The 13th European Conference on Computer Vision (ECCV), 2014