Tianfei Zhou , Wang Xia , Fei Zhang , Boyu Chang , Wenguan Wang , Ye Yuan , Ender Konukoglu , Daniel Cremers
This repository complies a collection of resources on image segmentation in foundation model era, and will be continuously updated to track developments in the field. Please feel free to submit a pull request if you find any work missing.
Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM, SAM2). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research (as shown in the following figure) – generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) – by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research.
Given the emergency capabilities of LLMs, a natural question arises: Do segmentation properties emerge from FMs? The answer is positive, even for FMs not explicitly designed for segmentation, such as CLIP, DINO and Diffusion Models. This also unlocks a new frontier in image segmentation, i.e., acquiring segmentation without any training. The following figure illustrates how to approach this and shows some examples:
- 2.1 Segmentation Emerges from CLIP
- 2.2 Segmentation Emerges from DMs
- 2.3 Segmentation Emerges from DINO
If you find our survey and repository useful for your research, please consider citing our paper:
@article{zhou2024SegFMSurvey
title={Image Segmentation in Foundation Model Era: A Survey},
author={Zhou, Tianfei and Xia, Wang and Zhang, Fei and Chang, Boyu and Wang, Wenguan and Yuan, Ye and Konukoglu, Ender and Cremers, Daniel},
journal={arXiv preprint arXiv:2408.12957},
year={2024},
}