We particularly appreciate the annotation efforts from Andrew Delworth and Elise Carman for the attribute-attribute object dataset. This project is built on top of the ideas of CLIP, compositional zero-shot learning and language model prompting.
https://github.com/openai/CLIP.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. (2021). Learning Transferable Visual Models From Natural Language Supervision.
We obtained the seen/unseen split information for the datasets with the download_data.sh
scripts supplied by https://github.com/ExplainableML/czsl
For evaluation, we used the following datasets.
http://web.mit.edu/phillipi/Public/states_and_transformations/index.html.
Phillip Isola*, Joseph J. Lim*, and Edward H. Adelson. (2015). Discovering States and Transformations in Image Collections.
https://vision.cs.utexas.edu/projects/finegrained/utzap50k/.
A. Yu and K. Grauman. (2014). Fine-Grained Visual Comparisons with Local Learning.
https://arxiv.org/pdf/2102.01987.pdf.
Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, Zeynep Akata (2021). Learning Graph Embeddings for Compositional Zero-shot Learning.
The evaluation for compositional zero-shot learning is based on the following codebases:
We obtained the code from https://github.com/ExplainableML/czsl.
We obtained the code from https://github.com/FrankRuis/protoprop