difference between paper and implementation in gradcam calculation

Hi, thank you for your wonderful work.

I've noticed that in the paper, the relevance score between image patches and tokens are calculated as:
![Image](https://github.com/user-attachments/assets/3928b0d1-9834-4263-a592-9f0529890406)
where the postive values of gradients are set to 0 through the **min** function, leaving only negative values. The reason for doing that can be quoted as:
> Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.

But in your code [implementation,](https://github.com/salesforce/LAVIS/blob/506965b9c4a18c1e565bd32acaccabe0198433f7/lavis/models/blip_models/blip_image_text_matching.py#L177) a clamp(0) function is applied to gradients that is supposed to assign 0 to negative values. Isn't it actually a **max** function instead of **min**?
`        grads = (
            grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24)
            * mask
        )
`

Could anyone provide a explaination? Thanks a lot!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

difference between paper and implementation in gradcam calculation #789

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

difference between paper and implementation in gradcam calculation #789

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions