Skip to content

jianzongwu/Awesome-Open-Vocabulary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome PR's Welcome

Towards Open Vocabulary Learning: A Survey

T-PAMI, 2024
Jianzong Wu * . Xiangtai Li * · Shilin Xu * · Haobo Yuan * · Henghui Ding · Yibo Yang · Xia Li · Jiangning Zhang · Yunhai Tong · Xudong Jiang · Bernard Ghanem · Dacheng Tao ·

arXiv PDF TPAMI PDF


This repo is used for recording, tracking, and benchmarking several recent open vocabulary methods to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🔥Add Your Paper in our Repo and Survey!!!!!

[-] You are welcome to give us an issue or PR for your open vocabulary learning work !!!!!

[-] Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.

[-] Our survey will be updated in 2024.3.

🔥New

[-] Our work is accepted by T-PAMI !!! 🔥🔥🔥

[-] We update GitHub to record the available paper by the end of 2024/1/10.

[-] We update GitHub to record the available paper by the end of 2023/7/20.

🔥Highlight!!

[1] The first survey for open vocabulary learning, including open vocabulary detection/segmentation/tracking.

[2] It also contains several related domains, including foundation model tuning and open-world detection.

[3] We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.

Introduction

This survey presents the first detailed survey on open vocabulary tasks, including open-vocabulary object detection, open-vocabulary segmentation, and 3D/video open-vocabulary tasks.

Alt Text

Summary of Contents

Methods: A Survey

Keywords

  • cap.: Use caption as auxiliary training data
  • vlm.: Use pretrained VLMs like CLIP
  • pl.: Generate pseudo labels
  • w/o ps.: Training without pixel-level supervision
  • pre.: Vision-language pretraining
  • diff.: Use diffusion models
  • unify: Unify several tasks (semantic segmentation, instance segmentation, and panoptic segmentation)
  • sam: Use SAM (Segment Anything Model)
  • open.: Demonstrated with open-set capability. (only for Video Understanding)
  • audio.: With audio modality.
  • bench: Propose a benchmark.
  • other: Other methods that cannot be grouped into above ones.
  • no-train: Does not need training.

Open Vocabulary Object Detection

Year Venue Keywords Paper Title Code/Project
2021 CVPR cap. Open-Vocabulary Object Detection Using Captions Code
2022 ICLR vlm. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation Code
2022 CVPR cap., vlm., pre. RegionCLIP: Region-based Language-Image Pretraining Code
2022 CVPR vlm. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model Code
2022 CVPR vlm., cap. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation Code
2022 CVPR cap., vlm. Grounded Language-Image Pre-training [Code]
2022 NeurIPS cap., vlm. GLIPv2: Unifying Localization and VL Understanding Code
2022 GCPR cap. Localized Vision-Language Matching for Open-vocabulary Object Detection Code
2022 ECCV vlm. Open-Vocabulary DETR with Conditional Matching Code
2022 ECCV vlm., cap., pl. Open Vocabulary Object Detection with Pseudo Bounding-Box Labels Code
2022 ECCV vlm. Promptdet: Towards open-vocabulary detection using uncurated images Code
2022 ECCV vlm., pl., w/o ps. Detecting Twenty-thousand Classes using Image-level Supervision Code
2022 ECCV vlm.. pl. Exploiting unlabeled data with vision and language models for object detection Code
2022 ECCV vlm., cap. Simple Open-Vocabulary Object Detection with Vision Transformers Code
2022 NeurIPS vlm., pl. Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection Code
2022 NeurIPS vlm., cap. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection N/A
2022 arXiv vlm. Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization Code
2022 arXiv vlm., pl. P3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection N/A
2023 ICLR vlm., pl. Learning Object-Language Alignments for Open-Vocabulary Object Detection Code
2023 ICLR vlm. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models Code
2023 CVPR other., vlm. Learning to Detect and Segment for Open Vocabulary Object Detection N/A
2023 CVPR vlm., cap. Aligning Bag of Regions for Open-Vocabulary Object Detection Code
2023 CVPR vlm. Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection Code
2023 CVPR vlm. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching N/A
2023 CVPR vlm., pl. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment N/A
2023 CVPR vlm. Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers N/A
2023 ICML vlm. Multi-Modal Classifiers for Open-Vocabulary Object Detection Project
2023 arXiv vlm. GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning N/A
2023 arXiv vlm., cap. Enhancing the Role of Context in Region-Word Alignment for Object Detection N/A
2023 arXiv cap., pl. Open-Vocabulary Object Detection using Pseudo Caption Labels N/A
2023 arXiv vlm., pl. Three ways to improve feature alignment for open vocabulary detection N/A
2023 arXiv vlm. Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection N/A
2023 TMLR vlm., cap., pl. MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks N/A
2023 NeurIPS vlm., cap., pl. Scaling Open-Vocabulary Object Detection N/A
2023 arXiv vlm. Open-Vocabulary Object Detection via Scene Graph Discovery N/A
2023 ICCV vlm. Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection Code
2023 ICCV vlm. EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment Code
2023 KDD vlm. What Makes Good Open-Vocabulary Detector: A Disassembling Perspective N/A
2023 NeurIPS vlm. CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection Code
2023 arXiv vlm. DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection Code
2023 arXiv vlm. Taming Self-Training for Open-Vocabulary Object Detection Code
2023 arXiv unify., vlm., pre. CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction Code
2023 BMVC vlm. Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization N/A
2024 AAAI vlm. Simple Image-level Classification Improves Open-vocabulary Object Detection Code
2024 AAAI vlm. ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection Code
2024 AAAI unify., vlm., pre. CLIM: Contrastive Language-Image Mosaic for Region Representation Code
2024 WACV vlm. LP-OVOD: Open-Vocabulary Object Detection by Linear Probing Code
2024 CVPR vlm. YOLO-World: Real-Time Open-Vocabulary Object Detection Code
2024 CVPR bench The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding Project
2024 ICLR vlm. LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors N/A
2024 arXiv vlm. Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection Code

Open Vocabulary Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR unify., vlm. Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation Code
2023 CVPR unify., vlm. FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation Code
2023 arXiv unify., vlm. OpenSD: Unified Open-Vocabulary Segmentation and Detection Code

Semantic Segmentation

Year Venue Keywords Paper Title Code/Project
2022 ICLR vlm. Language-driven Semantic Segmentation Code
2022 CVPR cap., w/o ps. GroupViT: Semantic Segmentation Emerges from Text Supervision Code
2022 CVPR vlm. ZegFormer: Decoupling Zero-Shot Semantic Segmentation Code
2022 ECCV cap., vlm. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels N/A
2022 ECCV vlm, pl, w/o ps. Extract Free Dense Labels from CLIP Code
2022 ECCV vlm. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model Code
2022 ECCV vlm., cap., w/o ps. Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding N/A
2022 BMVC vlm. Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models Code
2022 arXiv vlm., cap., pl, w/o ps. Perceptual Grouping in Contrastive Vision-Language Models Code
2022 arXiv vlm., cap., pl, w/o ps. SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation Code
2022 arXiv vlm., cap., w/o ps. Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning N/A
2023 CVPR vlm., pre. Generalized Decoding for Pixel, Image, and Language Code
2023 CVPR vlm., pl. Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP Code
2023 CVPR cap., vlm., w/o ps. Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision Code
2023 CVPR vlm. Side Adapter Network for Open-Vocabulary Semantic Segmentation Codd
2023 arXiv vlm., unify A Simple Framework for Open-Vocabulary Segmentation and Detection Code
2023 arXiv vlm. Global Knowledge Calibration for Fast Open-Vocabulary Segmentation N/A
2023 arXiv vlm. CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation Code
2023 arXiv vlm., unify Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Code
2023 arXiv vlm., unify Segment Everything Everywhere All at Once Code
2023 arXiv vlm. MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation N/A
2023 arXiv vlm. TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation N/A
2023 arXiv vlm., w/o ps., sam Exploring Open-Vocabulary Semantic Segmentation without Human Labels N/A
2023 arXiv vlm., unify DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model N/A
2023 arXiv diff. Diffusion Models for Zero-Shot Open-Vocabulary Segmentation Project
2023 ICCV diff. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models Project
2023 ICCV diff. Guiding Text-to-Image Diffusion Model Towards Grounded Generation Project
2023 NeurIPS cap., w/o ps. Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation Code
2023 arXiv vlm. SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation Code
2023 arXiv vlm., no-train Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models N/A
2023 arXiv vlm., no-train Grounding Everything: Emerging Localization Properties in Vision-Language Transformers Code
2023 arXiv vlm. Open-Vocabulary Segmentation with Semantic-Assisted Calibration N/A
2023 arXiv vlm., no-train Self-Guided Open-Vocabulary Semantic Segmentation N/A
2023 arXiv no-train., vlm., sam CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor Project
2023 arXiv vlm. CLIP-DINOiser: Teaching CLIP a few DINO tricks Code
2024 arXiv vlm., no-train Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation Code
2024 ECCV vlm., no-train In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation Code

Instance Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm. Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation Code
2022 CVPR cap., pl., vlm. Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling Code
2023 CVPR vlm, cap, w/o ps. Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations Code
2023 arXiv cap. Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Code
2023 arXiv cap. Leveraging Open-Vocabulary Diffusion to Camouflaged Instance Segmentation N/A

Panoptic Segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR unify., vlm. Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation Code
2022 arXiv vlm Open-Vocabulary Panoptic Segmentation with MaskCLIP N/A
2023 CVPR diff, vlm Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models Code
2023 ICCV vlm. Open-vocabulary Panoptic Segmentation with Embedding Modulation N/A
2023 NeurIPS vlm., unify Hierarchical Open-vocabulary Universal Image Segmentation Code
2024 CVPR vlm., unify, 'open' OMG-Seg: Is One Model Good Enough For All Segmentation? Code

Open Vocabulary Video Understanding

Video Classification

Year Venue Keywords Paper Title Code/Project
2021 arXiv vlm.,open. ActionCLIP: A New Paradigm for Video Action Recognition Code
2022 ECCV vlm.,open. Prompting Visual-Language Models for Efficient Video Understanding Project
2022 ECCV vlm. Frozen CLIP Models are Efficient Video Learners Code
2022 ECCV vlm.,open. Expanding Language-Image Pretrained Models for General Video Recognition Code
2022 arXiv vlm.,open.,audio. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models N/A
2023 AAAI vlm.,open. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition Code
2023 ICLR vlm. AIM: Adapting Image Models for Efficient Video Action Recognition Project
2023 CVPR vlm.,open. Fine-tuned CLIP Models are Efficient Video Learners Code
2023 ICML vlm.,open. Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization Code
2023 ICCV vlm.,open. Video Action Recognition with Attentive Semantic Units N/A
2023 ICCV vlm.,open. MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge Code
2023 arXiv vlm.,open. VicTR: Video-conditioned Text Representations for Activity Recognition N/A
2023 arXiv vlm.,open. Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition N/A
2024 NeurIPS vlm.,open. AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation Code

Tracking

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm.,open. OVTrack: Open-Vocabulary Multiple Object Tracking Project

Video Instance Segmentation

Year Venue Keywords Paper Title Code/Project
2023 ICCV vlm.,open. Towards Open-Vocabulary Video Instance Segmentation Code
2023 arXiv vlm.,open. OpenVIS: Open-vocabulary Video Instance Segmentation N/A
2023 arXiv vlm.,open. DVIS++: Improved Decoupled Framework for Universal Video Segmentation Code

Open Vocabulary 3D Scene Understanding

3D Classification

Year Venue Keywords Paper Title Code/Project
2022 CVPR vlm. PointCLIP: Point Cloud Understanding by CLIP Code
2023 CVPR vlm. ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding Code
2023 ICCV vlm. PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning Code
2023 ICCV vlm. CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training Code
2023 ICML vlm. Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining Code
2024 WACV vlm. LidarCLIP or: How I Learned to Talk to Point Clouds Code

3D Detection

Year Venue Keywords Paper Title Code/Project
2022 arXiv vlm. Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning N/A
2023 CVPR vlm. Open-Vocabulary Point-Cloud Object Detection without 3D Annotation Code
2023 NeurIPS vlm. CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection Project
2023 arXiv vlm. Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection N/A
2023 arXiv vlm. FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection N/A
2023 arXiv vlm. OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection N/A

3D segmentation

Year Venue Keywords Paper Title Code/Project
2023 CVPR vlm. PLA: Language-Driven Open-Vocabulary 3D Scene Understanding Code
2023 CVPR vlm. CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP Code
2023 CVPR vlm. OpenScene: 3D Scene Understanding with Open Vocabularies Project
2023 ICCVW vlm. CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP N/A
2023 NeurIPS vlm. OpenMask3D: Open-Vocabulary 3D Instance Segmentation Project
2023 arXiv vlm. OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation Project
2023 arXiv vlm. Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance Project
2024 arXiv vlm. UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation Code
2024 arXiv vlm. OpenSU3D: Open World 3D Scene Understanding using Foundation Models Project

Related Domains and Beyond

Class-agnostic Detection and Segmentation

Year Venue Keywords Paper Title Code/Project
2022 RA-L - Learning Open-World Object Proposals without Learning to Classify Code
2021 ICCV - Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation Project
2022 CVPR - Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity Project
2022 ECCV - Class-agnostic object detection with multi-modal transformer Code
2022 TPAMI - Open World Entity Segmentation Project
2023 ICCV - Fine-Grained Entity Segmentation Project
2023 ICCV bench SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning Code

Open-World Object Detection

Year Venue Keywords Paper Title Code/Project
2015 CVPR - Towards Open World Recognition N/A
2021 CVPR - Towards Open World Object Detection. Code
2022 CVPR - OW-DETR: Open-world Detection Transformer Code
2022 ECCV - UC-OWOD: Unknown-Classified Open World Object Detection Code
2022 arXiv - Revisiting Open World Object Detection Code
2022 arXiv - Rectifying Open-set Object Detection: A Taxonomy, Practical Applications, and Proper Evaluation [N/A]
2022 arXiv - Open World DETR: Transformer based Open World Object Detection N/A
2023 CVPR - PROB: Probabilistic Objectness for Open World Object Detection Code
2023 arXiv - Open World Object Detection in the Era of Foundation Models Code
2023 arXiv - Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection [N/A]

Open-Set Panoptic Segmentation

Year Venue Keywords Paper Title Code/Project
2021 CVPR - Exemplar-Based Open-Set Panoptic Segmentation Network Project
2022 BMVC - Dual Decision Improves Open-Set Panoptic Segmentation Code

Acknowledgement

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{wu2023open,
      title={Towards Open Vocabulary Learning: A Survey},
      author={Jianzong Wu and Xiangtai Li and Shilin Xu and Haobo Yuan and Henghui Ding and Yibo Yang and Xia Li and Jiangning Zhang and Yunhai Tong and Xudong Jiang and Bernard Ghanem and Dacheng Tao},
      year={2024},
      journal={T-PAMI},
}

Contact

jzwu@stu.pku.edu.cn
lxtpku@pku.edu.cn or xiangtai94@gmail.com

Alt Text