Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick, ICCV 2023
This paper proposes a foundation model for computer vision. The model is trained on promptable segmentation task i.e. returns a valid segmentation mask for any prompt. This paper uses a data engine to iterate through the model and dataset. As a result, it introduces the largest segmentation dataset ( at the time of writing ) SA-1B with over 1B masks.
- The Segment Anything Model: It is a foundation model pretrained to return a valid segmentation mask for any prompt. It supports diverse prompts, i.e. bounding box, mask, point and text.
- SA-1B dataset: It includes 11M images and 1B masks from the fully automatic stage of the data engine. It has 11x more images and 400x more masks than the second largest dataset, i.e. Open Images.
-
Task: It is pretrained on promptable segmentation task as this pretraining objective is general enough to enable zero-shot generalization to novel downstream tasks and data distributions by composing it as a part of a larger system and prompt engineering.
-
Model Architecture:
- Model Training:
- Loss: Focal and dice losses are used in a 20:1 ratio. IoU prediction head is trained with MSE loss and added with a constant scaling factor of 1.0 to mask loss. We backpropagate the minimum loss over the three mask predictions.
- We initially prompt with points sampled from ground truth mask or box prompts with little noise added to coordinates.
- In subsequent iterations, iteratively sample points from the error region and provide unthresholded logits of the most confident prediction from previous iterations to provide maximal information. We do 8 such iterations.
- We do 2 more iterations with no additional point prompts to benefit from the supplied mask.
SAM is extensively evaluated on a suite of 23 datasets with diverse image distributions to verify if it has generalized beyond its training dataset.
- SAM performs better than the strongest baseline, i.e. RITM, for single point valid mask prediction task on most datasets when we select the most confident masks and on all datasets when the prediction most similar to ground truth is compared. In human studies, SAM gets consistently higher ratings than RITM. Though the ambiguity-unaware version of SAM is rated lower than the ambiguity-aware version, it is still rated higher than RITM.
- Though SAM is not trained for edge detection when combined with Sobel filtering and NMS, it produces reasonable edge maps, which are even more extensive than ground truth. It outperforms various methods, though it naturally lags behind state-of-the-art methods trained on the same dataset and learning its biases.
- For the object proposal generation task, though SAM performs worse than the baseline overall, it performs better for medium and large objects, along with common and rare objects. It only underperforms on small and frequent objects for which the baseline model trained on the same dataset could easily learn dataset-specific biases.
- SAM is evaluated, for instance, segmentation on COCO and LVIS by composing it with an object detector. Though it performs worse than ViTDet-H on automatic metrics, it consistently performs better in human study. Low ground truth quality and learning of dataset-specific biases by ViTDet seem to explain this.
- SAM is not robust enough for text prompts yet, but preliminary results are encouraging.
Foundation models pre-trained on large-scale datasets seem like the way forward for high performance on many downstream tasks, and SAM is a key development in this direction. The concept of a data engine could be a possible solution for the lack of annotated data in various domains.