This repo contains the code and pre-trained models that we released along with our paper:
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, and Rada Mihalcea.
TL;DR: CLoVe is a framework to significantly improve the ability of existing CLIP-like models to encode compositional language while keeping or improving the performance on standard vision-language tasks.
This codebase is largely based on OpenCLIP's,
containing its changes until (and including)
the commit 73fa7f0
.
We are specially thankful to the authors of OpenCLIP for their work.
It's recommended to have a CUDA-12-enabled GPU with the NVIDIA driver version 530 or greater with CUDA 12.1 or
later installed.
If you don't have this, you need to change the pyproject.toml
file to use a different version of
PyTorch.
With Python 3.10 or later, clone this repo and run:
pip install -e .
# export PYTHONPATH=src # TODO: I think it's not necessary.
You need to have a rarfile backend installed (e.g., unrar
).
For UCF101, given that UCF's website server's certificate chain is incomplete, you need to run the following (note that this command runs two sudo commands to include an intermediate Certificate Authority certificate to the system):
./scripts/add_missing_ssl_certs.sh
You need to be logged in to HuggingFace:
huggingface-cli login
You also need to accept the terms of use for the dataset.
If you want to run our code with Ray, follow these steps.
First, you need to generate a requirements.txt
file.
For it, you need to use Poetry (which we use to define the high-level dependencies):
poetry self add poetry-plugin-export
poetry export --format requirements.txt --output requirements.txt --without-hashes
After this, see the example files under scripts/
, such as example_ray.sh
.
If you want to use the pre-trained model, do:
import torch
from PIL import Image
from cached_path import cached_path
from open_clip import create_model_and_preprocessing
from training.file_utils import pt_load
from training.utils import get_state_dict, patch_model
model, _, transform, tokenizer = create_model_and_preprocessing("ViT-B-32", "openai")
model.eval()
URL = ("https://github.com/Netflix/clove/releases/download/pretrained/"
"clove_without_patching.pt")
patch_model(model, get_state_dict(pt_load(URL), model), weight_for_state_dict=0.6)
image_path = cached_path(
"https://github.com/mlfoundations/open_clip/blob/main/docs/CLIP.png?raw=true")
image = transform(Image.open(image_path)).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
with torch.inference_mode(), torch.cuda.amp.autocast():
output = model(image, text)
image_features, text_features = output["image_features"], output["text_features"]
print("Label probs:", (100 * image_features @ text_features.T).softmax(dim=-1))
# Prints `[[9.9900e-01, 7.4042e-04, 2.6385e-04]]`.
To evaluate our model on all the benchmarks from the paper, run:
python -m training \
--eval-benchmarks aro color didemo hmdb51 imagenet-v2 imagenet-val msrvtt sts sugar-crepe svo-probes ucf101 val \
winoground youcook2 cb/wds/cars cb/wds/vtab/cifar10 cb/wds/vtab/cifar100 cb/wds/mnist \
cb/wds/vtab/eurosat cb/wds/vtab/flowers cb/wds/vtab/dtd \
--model ViT-B-32 \
--pretrained openai \
--wise-ft https://github.com/Netflix/clove/releases/download/pretrained/clove_without_patching.pt \
--wise-ft-weight-for-2 0.6
You can list all the available program options by running:
python -m training --help
To reproduce our training (fine-tuning) procedure, you would need a machine with 8x GPUs with enough memory (e.g., A10, A100, or A40; you may manage to reproduce similar results by adjusting some parameters) and you need to follow these steps:
- Download LAION-COCO in the webdataset format.
- Set its path in
_DATASET_SHORT_NAMES["laion-coco"]
insrc/training/params.py
. - Run:
./scripts/example_multi_gpu.sh
If your machine has fewer than 8 GPUs or doesn't adapt well to the script, review it and change it accordingly.
See more training code examples under scripts/
.
You can list all the available program options by running:
python -m training --help
Also, see OpenCLIP's repo for more details on how to train models.
@misc{clove,
title={{CLoVe}: Encoding Compositional Language in Contrastive Vision-Language Models},
author={Santiago Castro and Amir Ziai and Avneesh Saluja and Zhuoning Yuan and Rada Mihalcea},
howpublished={arXiv:2402.15021},
month=feb,
year={2024},
url={https://arxiv.org/abs/2402.15021},
eprint={2402.15021},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
If you use our code, please consider also citing OpenCLIP.
- further test the setup and instructions
- add a figure at the beginning
- change the example code to be more compositionality-specific
- add a repo description and tags
- upload an already-patched model
- maybe create a table with the available pre-trained weights and reference performance
- provide pre-trained weights for larger models
- make it easy to install as a library
- incorporate the weights in open_clip