We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so.
The pretraining datasets used in OFA are all publicly available. Here we provide the public links to these data, it is recommended that you download the data from the links first, and then process the downloaded dataset into a similar format as the examples we provided.
- CC12M: https://github.com/google-research-datasets/conceptual-12m
- CC3M: https://github.com/google-research-datasets/conceptual-captions
- SBU: https://www.cs.virginia.edu/~vicente/sbucaptions
- COCO: https://cocodataset.org/#home
- VG: https://visualgenome.org/
- VQAv2: https://visualqa.org/
- GQA: https://cs.stanford.edu/people/dorarad/gqa/about.html
- RefCOCO/RefCOCO+/RefCOCOg: https://github.com/lichengunc/refer
- OpenImages: https://storage.googleapis.com/openimages/web/index.html
- Object365: https://www.objects365.org/overview.html
- YFCC100M (subset): https://github.com/openai/CLIP/blob/main/data/yfcc100m.md
- ImageNet-21K: https://image-net.org/index.php
- Pile: https://pile.eleuther.ai
- Dataset for Caption
- Dataset for RefCOCO
- Dataset for RefCOCO+
- Dataset for RefCOCOg
- Dataset for VQAv2 (we have also provided chunked parts of the dataset files for more convenient downloading, please refer to issue #68)
- Dataset for SNLI-VE
- Dataset for Text-to-Image Genearion
- Dataset for Text-to-Image Genearion (with original id)
- Dataset for COLA
- Dataset for MNLI
- Dataset for MRPC
- Dataset for QNLI
- Dataset for QQP
- Dataset for RTE
- Dataset for SST2
- Dataset for Gigaword
Here we provide raw image files for visualization examples in OFA.