-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The implementation of ResNet is different from official implementation in Caffe #191
Comments
From what I see, the torchvision implementation also uses 1x1 convolution kernels when downsampling, see here for example |
This is only partially true (and the issue should not be closed). Downsample is one of the convolutions that should have stride 2 (and it has, like you pointed out, @fmassa), but there is also convolutions in bottleneck block (which the original issue was referencing) - see here. Here also it is the first convolution (1x1) that should have stride=stride, not the second convolution (3x3). |
Also, Table 1 in the paper describes that "downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2." If I am not missing something in the code, it seems that the Bottleneck layer is using stride 2 in the second convolution, instead of using it in the first convolution (as pointed out by @lyuwenyu and @ptrendx). For instance, in the last convolutional group, we have a Bottleneck following this pattern: # Bottleneck layer
out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=2)
out = conv3_bn_relu(out, kernel=1, stride=1) While it should be: out = conv1_bn_relu(out, kernel=1, stride=2)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1)
out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1)
out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1) If you paste the original prototxt in this network visualizer, in the last convolutional group only conv5_1 ( EDIT: clarity and corrected possible fix I think it could be fixed by changing here to: self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, |
While I agree that the definition of the Bottleneck module seems to be different than the one mentioned in the original paper, I believe that what is currently done is throws away much less information in the beginning of each block (at the expense of a smaller receptive field). Indeed, the original implementation seems to throw away 75% of the input of each residual at the beginning of each Bottleneck module (1x1 conv with stride of 2). Note that the I'm reopening the issue and tagging @colesbury (who originally implemented ResNet in PyTorch). |
That makes sense, @fmassa. A lot is being discarded in the original implementation. To add to this discussion, according to this user, Kaiming He wrote:
I actually tried to fine-tune both variations to my task (which possibly isn't the most suitable way to evaluate it, though), and they both gave similar results. |
I try to summarise:
After all, some comments may be needed in |
@Dirtybluer a PR adding some comments to the resnet code would be great! |
Perhaps we could have a (say) Of course, you could cook a script yourself to hack a resnet instance to move the downsampling to the 1x1 convolution, but I think it would be better if everyone could rely on this being implemented consistently. What do you think? If the above sounds reasonable, I can throw a PR. |
Just FYI, training a Resnet34 model on CIFAR10 gives much worse performance when done with torchvision's version:
I was struggling to reproduce CIFAR10 results as I assumed the performance should be similar between the two repos. |
@chledowski The input image size of CIFAR10 is much smaller than ImageNet, I guess you can prune off one layer of TorchVision's model to get similar results of Kuang Liu's, and seems that's tricks behind Liu's repo. |
Thanks for the info! You're right, I just read that the first CNN layer in torchvision has kernel of size 7, stride 2, and padding 3, while Kuang Liu uses kernel 3, stride 1 & no padding I think. |
Same here on CIFAR100. It makes me frustrated with hyperparam tuning, and I can't find where the problem is. The cyan one is the implementation from: https://github.com/weiaicunzai/pytorch-cifar100, and the pink one is the pytorch's implementation. |
The
downsample
part in each block/layer (not the skip connection part), the PyTorch do it in conv3x3 using stride=2, but official caffe version in conv1x1 with stride=2Here in Bottleneck:
but in caffe
The text was updated successfully, but these errors were encountered: