This repo contains the code for the CLIP Explainability project. In this project, we conduct an in-depth study of CLIP’s learned image and text representations using saliency map visualization. We propose a modification to the existing saliency visualization method that improves its performance as shown by our qualitative evaluations. We then use this method to study CLIP’s ability in capturing similarities and dissimilarities between an input image and targets belonging to different domains including image, text, and emotion.
To install the required libraries run the following command:
pip install -r requirements.txt
code directory contains
- the implementation of saliency visualization methods: for [ViT] (code/vit_cam.py) and ResNet (code/rn_cam.py)-based CLIP
- GradCAM implementation based on pytorch-grad-cam slightly modified to adapt to CLIP.
- A re-implementation of CLIP taken from Transformer-MM-Explainability repo that keeps tack of attention maps and gradients: clip_.py
- Notebooks for the experiments explained in the report
Images contains images used in the experiments.
results contains the results obtained from the experiments. Any result generated by the notebooks will be stored in this directory.
Notebook Name | Experiment | Note |
---|---|---|
vit_block_vis | Layer-wise Attention Visualization | - |
saliency_method_compare | ViT Explainability Method Comparison | Qualitative comparison |
affectnet_emotions | ViT Explainability Method Comparison | Bias comparison; you need to download a sample of the AffectNet dataset here and place it in Images. |
pos_neg_vis | Positive vs Negative Saliency | - |
artemis_emotions | Emotion-Image Similarity | you need to download the pre-processed WikiArt images here and place it in Images. Note that this notebook chooses images randomly so the results may not be the same as the ones in the report. |
perword_vis | Word-Wise Saliency Visualization | |
global_vis | - | can be used to visualize saliency maps for ViT and ResNet-based CLIP. |