https://github.com/AlexHex7/Non-local_pytorch/blob/589dde8cc048b2e963cdd3e06f96f7ef20130dcf/lib/non_local_simple_version.py#L59 In the paper, the pooling is only in the spatial domain. I think the kernel size is not set correctly when the input contains temporal dimension. https://github.com/facebookresearch/video-nonlocal-net/blob/b273c446e8e10dbaec266520e4005d27d7052125/lib/models/nonlocal_helper.py#L40