The implementation of ResNet is different from official implementation in Caffe #191

lyuwenyu · 2017-06-26T02:51:21Z

The downsample part in each block/layer (not the skip connection part), the PyTorch do it in conv3x3 using stride=2, but official caffe version in conv1x1 with stride=2

conv1x1 -> caffe do it in here
conv3x3 -> pytorch do it in here
conv1x1

Here in Bottleneck:

        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
                               padding=1, bias=False)

  (layer2): Sequential (
    (0): Bottleneck (
      (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
       ...

but in caffe


layer {
	bottom: "res2c"
	top: "res3a_branch2a"
	name: "res3a_branch2a"
	type: "Convolution"
	convolution_param {
		num_output: 128
		kernel_size: 1
		pad: 0
		stride: 2
		bias_term: false
	}
}

The text was updated successfully, but these errors were encountered:

fmassa · 2017-09-03T18:58:52Z

From what I see, the torchvision implementation also uses 1x1 convolution kernels when downsampling, see here for example

ptrendx · 2017-10-24T22:10:11Z

This is only partially true (and the issue should not be closed). Downsample is one of the convolutions that should have stride 2 (and it has, like you pointed out, @fmassa), but there is also convolutions in bottleneck block (which the original issue was referencing) - see here. Here also it is the first convolution (1x1) that should have stride=stride, not the second convolution (3x3).

victorhcm · 2017-10-30T03:01:01Z

Also, Table 1 in the paper describes that "downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2." If I am not missing something in the code, it seems that the Bottleneck layer is using stride 2 in the second convolution, instead of using it in the first convolution (as pointed out by @lyuwenyu and @ptrendx). For instance, in the last convolutional group, we have a Bottleneck following this pattern:

# Bottleneck layer
out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=2)
out = conv3_bn_relu(out, kernel=1, stride=1)

While it should be:

out = conv1_bn_relu(out, kernel=1, stride=2)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1)

out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1)

out = conv1_bn_relu(out, kernel=1, stride=1)
out = conv2_bn_relu(out, kernel=3, stride=1)
out = conv3_bn_relu(out, kernel=1, stride=1)

If you paste the original prototxt in this network visualizer, in the last convolutional group only conv5_1 (res5a_branch2a) has stride 2, the following have stride 1.

EDIT: clarity and corrected possible fix

I think it could be fixed by changing here to:

self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1,

fmassa · 2017-11-12T16:41:34Z

While I agree that the definition of the Bottleneck module seems to be different than the one mentioned in the original paper, I believe that what is currently done is throws away much less information in the beginning of each block (at the expense of a smaller receptive field). Indeed, the original implementation seems to throw away 75% of the input of each residual at the beginning of each Bottleneck module (1x1 conv with stride of 2).

Note that the BasicBlock architecture follows the right pattern.

I'm reopening the issue and tagging @colesbury (who originally implemented ResNet in PyTorch).
To summarize, the original paper mentions downsampling to happen here, while we are doing it here. The same is present in fb-resnet.torch.

victorhcm · 2017-11-20T13:25:50Z

That makes sense, @fmassa. A lot is being discarded in the original implementation.

To add to this discussion, according to this user, Kaiming He wrote:

In all experiments in the paper, the stride=2 operation is in the first 1x1 conv layer when downsampling. This might not be the best choice, as it wastes some computations of the preceding block. For example, using stride=2 in the first 1x1 conv in the first block of conv3 is equivalent to using stride=2 in the 3x3 conv in the last block of conv2. So I feel applying stride=2 to either the first 1x1 or the 3x3 conv should work. I just kept it “as is”, because we do not have enough resources to investigate every choice.

I actually tried to fine-tune both variations to my task (which possibly isn't the most suitable way to evaluate it, though), and they both gave similar results.

Dirtybluer · 2020-03-11T16:52:08Z

I try to summarise:

implementation of ResNet in PyTorch does differ from the one in Kaiming He's original paper: it transfers the responsibility for downsampling from the first 1x1 convolutional layer to the 3x3 convolutional layer in Bottleneck.
This kind of variation is also known as "ResNet V1.5" as mentioned in resnet50 defined in torchvision/models/resnet.py is v1.5? #1266, which seems to be defined by NVIDIA according to Question about the name of "ResNet v1.5" NVIDIA/DeepLearningExamples#419 (comment).
The effects of this modification in practice has been pointed out here by NVIDIA

This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).

It may be unnecessary to change it back to the original implementation, since the differences are negligible(actually with accuracy increasing). Besides, changing it may affect the previously pre-trained models' performance.

After all, some comments may be needed in resnet.py to explain this situation as well as to close this issue and prevent similar issues in the future.
How do you think? @fmassa If needed, I can open a PR for it.

fmassa · 2020-03-13T13:22:25Z

@Dirtybluer a PR adding some comments to the resnet code would be great!

alegonz · 2020-05-27T13:21:52Z

Perhaps we could have a (say) v1_downsampling=False argument to choose the v1 implementation? This is particularly useful if you want to reproduce as closely as possible a paper which uses a v1 resnet backbone for something.

Of course, you could cook a script yourself to hack a resnet instance to move the downsampling to the 1x1 convolution, but I think it would be better if everyone could rely on this being implemented consistently.

What do you think? If the above sounds reasonable, I can throw a PR.

chledowski · 2022-09-17T23:40:35Z

Just FYI, training a Resnet34 model on CIFAR10 gives much worse performance when done with torchvision's version:

with torchvision's ResNet34 I am getting up to 89% accuracy on CIFAR10 (trained as in https://github.com/kuangliu/pytorch-cifar)
by using Kuangliu's version, I am easily getting 93% accuracy.

I was struggling to reproduce CIFAR10 results as I assumed the performance should be similar between the two repos.

zhiqwang · 2022-09-18T01:28:18Z

Just FYI, training a Resnet34 model on CIFAR10 gives much worse performance when done with torchvision's version:

with torchvision's ResNet34 I am getting up to 89% accuracy on CIFAR10 (trained as in https://github.com/kuangliu/pytorch-cifar)

by using Kuangliu's version, I am easily getting 93% accuracy.

@chledowski The input image size of CIFAR10 is much smaller than ImageNet, I guess you can prune off one layer of TorchVision's model to get similar results of Kuang Liu's, and seems that's tricks behind Liu's repo.

chledowski · 2022-09-18T07:36:16Z

Thanks for the info! You're right, I just read that the first CNN layer in torchvision has kernel of size 7, stride 2, and padding 3, while Kuang Liu uses kernel 3, stride 1 & no padding I think.

youyinnn · 2024-01-21T18:19:30Z

Just FYI, training a Resnet34 model on CIFAR10 gives much worse performance when done with torchvision's version:

with torchvision's ResNet34 I am getting up to 89% accuracy on CIFAR10 (trained as in https://github.com/kuangliu/pytorch-cifar)

by using Kuangliu's version, I am easily getting 93% accuracy.

I was struggling to reproduce CIFAR10 results as I assumed the performance should be similar between the two repos.

Same here on CIFAR100. It makes me frustrated with hyperparam tuning, and I can't find where the problem is.

The cyan one is the implementation from: https://github.com/weiaicunzai/pytorch-cifar100, and the pink one is the pytorch's implementation.

lyuwenyu changed the title ~~the implementation of resnet is different from official implementation in caffe~~ The implementation of resnet is different from official implementation in caffe Jun 26, 2017

lyuwenyu changed the title ~~The implementation of resnet is different from official implementation in caffe~~ The implementation of resnet is different from official implementation in Caffe Jun 26, 2017

lyuwenyu changed the title ~~The implementation of resnet is different from official implementation in Caffe~~ The implementation of ResNet is different from official implementation in Caffe Jun 26, 2017

fmassa closed this as completed Sep 3, 2017

fmassa reopened this Nov 12, 2017

fmassa mentioned this issue Mar 24, 2018

the layer1's stride should be set to 2? #457

Closed

fmassa added the wontfix label Mar 24, 2018

lyuwenyu closed this as completed Apr 1, 2018

lyuwenyu reopened this Apr 1, 2018

vfdev-5 mentioned this issue Sep 17, 2018

Difference from KaimingHe's bottleneck implementation #606

Closed

Godricly mentioned this issue May 9, 2019

bottleneck structure differs from pytorch vision facebookresearch/maskrcnn-benchmark#764

Closed

fmassa mentioned this issue Jul 29, 2019

Architecture #1177

Closed

fmassa added module: models topic: classification labels Jul 29, 2019

huangbiubiu mentioned this issue Oct 23, 2019

ImageNet ResNet FLOPs Eric-mingjie/rethinking-network-pruning#31

Closed

Dirtybluer mentioned this issue Mar 14, 2020

add comments for the modified implementation of ResNet #1983

Merged

fmassa mentioned this issue Mar 30, 2020

It seems the implementation of ResNet-101 is different from the model in the original paper #2030

Closed

alegonz mentioned this issue May 24, 2020

Mismatching resnet50 model for CIFAR-10 experiments Spijkervet/SimCLR#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The implementation of ResNet is different from official implementation in Caffe #191

The implementation of ResNet is different from official implementation in Caffe #191

lyuwenyu commented Jun 26, 2017 •

edited

Loading

fmassa commented Sep 3, 2017

ptrendx commented Oct 24, 2017

victorhcm commented Oct 30, 2017 •

edited

Loading

fmassa commented Nov 12, 2017

victorhcm commented Nov 20, 2017 •

edited

Loading

Dirtybluer commented Mar 11, 2020

fmassa commented Mar 13, 2020

alegonz commented May 27, 2020

chledowski commented Sep 17, 2022

zhiqwang commented Sep 18, 2022 •

edited

Loading

chledowski commented Sep 18, 2022

youyinnn commented Jan 21, 2024 •

edited

Loading

The implementation of ResNet is different from official implementation in Caffe #191

The implementation of ResNet is different from official implementation in Caffe #191

Comments

lyuwenyu commented Jun 26, 2017 • edited Loading

fmassa commented Sep 3, 2017

ptrendx commented Oct 24, 2017

victorhcm commented Oct 30, 2017 • edited Loading

fmassa commented Nov 12, 2017

victorhcm commented Nov 20, 2017 • edited Loading

Dirtybluer commented Mar 11, 2020

fmassa commented Mar 13, 2020

alegonz commented May 27, 2020

chledowski commented Sep 17, 2022

zhiqwang commented Sep 18, 2022 • edited Loading

chledowski commented Sep 18, 2022

youyinnn commented Jan 21, 2024 • edited Loading

lyuwenyu commented Jun 26, 2017 •

edited

Loading

victorhcm commented Oct 30, 2017 •

edited

Loading

victorhcm commented Nov 20, 2017 •

edited

Loading

zhiqwang commented Sep 18, 2022 •

edited

Loading

youyinnn commented Jan 21, 2024 •

edited

Loading