Hello, authors!
I found a small problem with the implementation of the paper content.
The fusion method for the two Updated Agg Tokens for video and audio mentioned in the paper is linear aggregation. And the paper does not explain how the weight matrix of the two features is calculated.
However, in the code, the two features are only joined in the channel dimension after dimension expansion. The code is as follows.
xva = torch.cat((
xv.unsqueeze(2).repeat(1, 1, na, 1),
xa.unsqueeze(1).repeat(1, nv, 1, 1),
), dim=3).flatten(1, 2)
So how should I understand this part of the content, looking forward to your answer!