The method of DenseAVInteractions is inconsistent with that of the paper

Hello, authors!
I found a small problem with the implementation of the paper content. 
The fusion method for the two Updated Agg Tokens for video and audio mentioned in the paper is linear aggregation. And the paper does not explain how the weight matrix of the two features is calculated.
However, in the code, the two features are only joined in the channel dimension after dimension expansion. The code is as follows.

xva = torch.cat((
            xv.unsqueeze(2).repeat(1, 1, na, 1),
            xa.unsqueeze(1).repeat(1, nv, 1, 1),
        ), dim=3).flatten(1, 2)


So how should I understand this part of the content, looking forward to your answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The method of DenseAVInteractions is inconsistent with that of the paper #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The method of DenseAVInteractions is inconsistent with that of the paper #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions