|
self.g = nn.Sequential(self.g, max_pool(kernel_size=2)) |
In the paper, the pooling is only in the spatial domain. I think the kernel size is not set correctly when the input contains temporal dimension.
https://github.com/facebookresearch/video-nonlocal-net/blob/b273c446e8e10dbaec266520e4005d27d7052125/lib/models/nonlocal_helper.py#L40