-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The document of torchvision.ops.deform_conv2d is not clear #3673
Comments
@Zhaoyi-Yan Perhaps the CPU source code can lend you some insights. I'd say your guess seems reasonable, but I don't remember much detail of DCN so I wouldn't be sure. |
After reading the source code you refer to, it seems reasonable. However, it would be better to get a detailed note on the offset for the users. |
Maybe some comments should be added for this. What do you think? @NicolasHug |
sure, any PR to improve the docs would be very welcome! |
I'm not an expert on DCN right now... So maybe you'd like to send a PR for this? @Zhaoyi-Yan |
I am not too... |
@NicolasHug @Zhaoyi-Yan I'll try send a PR to clarify the doc after some paper re-reading, if no one more familiar with DCN turns up. |
@Zhaoyi-Yan I've sent a PR for this. I now believe your initial guess is correct, if you consider the height direction as x, width direction as y. |
It would be very important to also know the order of the elements in (from docs): (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width, (offset) – out_height, out_width]): offsets to be applied for each position in the convolution kernel. I.e., what is the arrangement of 2 * offset_groups * kernel_height * kernel_width, is it in this particular order? Considering the comments here, the following would be more likely: I think it could be made a lot clearer with passing a tensor instead of a flattened array: (offset_group x kernel_height x kernel_width x 2) |
It is very confusing indeed. You could checkout this ongoing PR for some clarification (I tried but, the explanation there is still not very clear...)
I think that could introduce a BC-Break of some sort? Personally, I think maybe if deformable conv could be implemented as a pytorch layer, things would be much easier... |
I am not a developer, but I think this might be handled with a fixed internal flatten operation, which can handle both inputs? Personally, I think stating the exact order of elements encoded in the dimension "2 * offset_groups * kernel_height * kernel_width" in the docs would be sufficient, I like the functional approach of the current version. Assuming the order: T in groups x kernel_height x kernel_width x [offset_h, offset_w] then stating that the "flattened tensor" to pass to the function will be: [T[0,0,0,0], T[0,0,0,1], T[0,0,1,0], T[0,0,1,1],...] If this assumption is correct, for clarity, the docs should state: |
Maybe this demo will help us understand the role of offset. import torch
from torchvision.ops import deform_conv2d
h = w = 3
# batch_size, num_channels, out_height, out_width
x = torch.arange(h * w * 3, dtype=torch.float32).reshape(1, 3, h, w)
# to show the effect of offset more intuitively, only the case of kh=kw=1 is considered here
offset = torch.FloatTensor(
[ # create our predefined offset with offset_groups = 3
0,
-1, # sample the left pixel of the centroid pixel
0,
1, # sample the right pixel of the centroid pixel
-1,
0, # sample the top pixel of the centroid pixel
] # here, we divide the input channels into offset_groups groups with different offsets.
).reshape(1, 2 * 3 * 1 * 1, 1, 1)
# here we use the same offset for each local neighborhood in the single channel
# so we repeat the offset to the whole space: batch_size, 2 * offset_groups * kh * kw, out_height, out_width
offset = offset.repeat(1, 1, h, w)
weight = torch.FloatTensor(
[
[1, 0, 0], # only extract the first channel of the input tensor
[0, 1, 0], # only extract the second channel of the input tensor
[1, 1, 0], # add the first and the second channels of the input tensor
[0, 0, 1], # only extract the third channel of the input tensor
[0, 1, 0], # only extract the second channel of the input tensor
]
).reshape(5, 3, 1, 1)
deconv_shift = deform_conv2d(x, offset=offset, weight=weight)
print(deconv_shift)
"""
tensor([[[[ 0., 0., 1.], # offset=(0, -1) the first channel of the input tensor
[ 0., 3., 4.], # output hw indices (1, 2) => (1, 2-1) => input indices (1, 1)
[ 0., 6., 7.]], # output hw indices (2, 1) => (2, 1-1) => input indices (2, 0)
[[10., 11., 0.], # offset=(0, 1) the second channel of the input tensor
[13., 14., 0.], # output hw indices (1, 1) => (1, 1+1) => input indices (1, 2)
[16., 17., 0.]], # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)
[[10., 11., 1.], # offset=[(0, -1), (0, 1)], accumulate the first and second channels after being sampled with an offset.
[13., 17., 4.],
[16., 23., 7.]],
[[ 0., 0., 0.], # offset=(-1, 0) the third channel of the input tensor
[18., 19., 20.], # output hw indices (1, 1) => (1-1, 1) => input indices (0, 1)
[21., 22., 23.]], # output hw indices (2, 2) => (2-1, 2) => input indices (1, 2)
[[10., 11., 0.], # offset=(0, 1) the second channel of the input tensor
[13., 14., 0.], # output hw indices (1, 1) => (1, 1+1) => input indices (1, 2)
[16., 17., 0.]]]]) # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)
""" |
📚 Documentation
From the documentation, I cannot get the exact meaning of 18(ie, 233) channels of the offset in a deformable convolution?
I want to visualize the offset of the deformable convolution with kernel size 3*3.
So It’s essential for me to know what’s the exact meaning of these channels.
I write down something possible here:
The text was updated successfully, but these errors were encountered: