Roger David Soberanis-Mukul, Jiahuan Cheng, Jan Emily Mangulabnan, S. Swaroop Vedula, Masaru Ishii, Gregory D. Hager, Russ Taylor, Mathias Unberath
Machine learning approaches for multi-view geometric scene understanding in endoscopic surgery often assume temporal consistency across the frames to limit challenges that algorithms contend with. However, in monocular scenarios where multiple views are acquired sequentially rather than simultaneously, the static scene assumption is too strong because surgical tools move during the procedure. To enable multi-view models despite tool motion, masking these temporally inconsistent tool regions is a feasible solution. However, manual tool-masking requires a prohibitive effort, given that endoscopic video can contain thousands of frames. This underscores the need for (semi-)automated techniques to 1) automatically mask the tools and/or 2) semi-automatically annotate large datasets such that algorithms for 1) may be developed. To facilitate semi-automated annotation, any solution must be both generalizable, such that it can be used out-of-the-box on diverse datasets, and easy to use. Recent methods for surgical tool annotation require either fine-tuning on domain-specific data or excessive user interaction, limiting their application to new data. Our work introduces GSAM+Cutie, a surgical tool annotation process relying on a combination of two recent foundation models for text-based image segmentation and video object segmentation. We show that a combination of Grounded-SAM and Cutie models provides good generalization for robust text-prompt-based video-level binary segmentation on endoscopic video, streamlining the mask annotation task. Through quantitative evaluation on two datasets, including a proprietary in-house dataset and EndoVis, we show that GSAM+Cutie outperforms similar ensembles, like SAM-PT, for video object segmentation. We also discuss the limitations and future research directions that GSAM+Cutie can motivate.
Tested on Ubuntu 20.04.6
For our method, python=3.9
, torch=1.11.0
, torchvision=0.12.0
are used
- Clone the repository:
git clone git@github.com:arcadelab/cutie_plus_gsam.git --recurse-submodules
- Install with pip:
cd segment-anything; pip install -e .
cd GroundingDINO; pip install -e .
cd Cutie; pip install -e .
- Download checkpoints for GroundingDINO
cd GroundingDINO
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/tag/v0.1.0-alpha2
cd ..
- Download checkpoints for SAM under the below directory
cd segment-anything
mkdir checkpoints
- Download checkpoints for Cutie manually, or they will be automatically downloaded when running
run_cutie_scripting_demo.py
python scripts/download_models.py
Follow these simple steps to get started with the project:
Begin by configuring the settings in the config.yaml
file to suit your requirements. This file contains various parameters, such as checkpoint paths, text prompt, dataset, etc. You can find the detailed description of the config.yaml
file in the Configuration
section below.
python annotate_binary.py
The annotation will be automatically saved under data/annotation
folder
python run_cutie_scripting_demo.py
The masks will be automatically saved under data/masks
folder
python generate_video.py
Results will be saved under the data
folder
annot_image
: Name of the annotation image(prompt for GSAM) for the dataset.dataset_name
: Name of the datasetclass_name
: Specifies the class name, set to binary.images_dir
: Path to the directory containing images/frames for the dataset.annotation_dir
: Path to the directory containing annotations for the dataset.masks_dir
: Path to the directory containing masks for the dataset.video_dir
: Path to the directory containing videos for the dataset.video_frames_dir
: Path to the directory containing video frames for the dataset.
cutie_weights
: Specifies the weights used for Cutie modelcutie_size
: Size parameter for Cutie model, set to 480.
seg_model_type
: Specifies the type of segmentation model used, set to default.checkpoint
: Path to the checkpoint file for the SAM model.
grounding_config_path
: Path to the configuration file for GroundingDino.grounding_checkpoint_path
: Path to the checkpoint file for GroundingDinoTEXT_PROMPT
: Text prompt used for GroundingDino.BOX_TRESHOLD
: Threshold value for bounding boxes.TEXT_TRESHOLD
: Threshold value for text.
If you find our work useful, please cite us using the following BibTeX entry:
@article{gsam_cutie24,
title={GSAM+Cutie: Text-Promptable Tool Mask Annotation for Endoscopic Video},
author={Roger David Soberanis-Mukul, Jiahuan Cheng, Jan Emily Mangulabnan, S. Swaroop Vedula, Masaru Ishii, Gregory D. Hager, Russ Taylor, Mathias Unberath},
journal={arXiv},
year={2024}
}
Thanks Grounding DINO, Segment-Anything, and Cutie for their awesome work!
- Grounding DINO: https://github.com/IDEA-Research/GroundingDINO
- Segment-Anything: https://github.com/facebookresearch/segment-anything
- Cutie: https://github.com/hkchengrex/Cutie