Skip to content

difference between paper and implementation in gradcam calculation #789

@dengmengjie

Description

@dengmengjie

Hi, thank you for your wonderful work.

I've noticed that in the paper, the relevance score between image patches and tokens are calculated as:
Image
where the postive values of gradients are set to 0 through the min function, leaving only negative values. The reason for doing that can be quoted as:

Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.

But in your code implementation, a clamp(0) function is applied to gradients that is supposed to assign 0 to negative values. Isn't it actually a max function instead of min?
grads = ( grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24) * mask )

Could anyone provide a explaination? Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions