OV2Seg is the first end-to-end Open-Vocabulary video instance segmentation model, that can segment, track, and classify objects from novel categories with a Memory-Induced Transformer architecture.
- Linux or macOS with Python ≥ 3.6
- PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this. Note, please check PyTorch version matches that is required by Detectron2.
- Detectron2: follow Detectron2 installation instructions.
pip install -r requirements.txt
This is an example of how to setup a conda environment.
conda create --name ov2seg python=3.8 -y
conda activate ov2seg
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python
# under your working directory
git clone git@github.com:facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ..
# install LVIS API
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/lvis-dataset/lvis-api.git
# clone this repo
git clone git@github.com:haochenheheda/LVVIS.git
cd LVVIS
pip install -r requirements.txt
cd ov2seg/modeling/pixel_decoder/ops
sh make.sh
cd ../../../..
Structure for dataset
datasets
|-- LVVIS
|-- coco
|-- lvis
|-- metadata
the metadata contains pre-computed classifiers for each dataset, which are generated by DetPro. If you want to generate customer classifiers, please follow this project.
LVIS instance segmentation
Please download COCO and LVIS dataset following the instructions on detectron2.
LV-VIS
Download the LV-VIS validation videos and annotations, and organize the files according to the following structure.
datasets/LVVIS/
`-- val
|-- JPEGImages
|-- val_instances.json
|-- image_val_instances.json # for image oracle evaluation
Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training,
please download the models and place them under models/
, and following this tool to convert the format.
models
`-- resnet50_miil_21k.pkl
We provide a script scripts/train.sh
, that is made to train the OV2Seg model on LVIS dataset.
sh scripts/train.sh
To evaluate a model's performance, use
sh scripts/eval_video.sh # evaluate on LV-VIS val set (video)
sh scripts/eval_image.sh # evaluate on LV-VIS val set (image oracle)
You are expected to get results like this:
Backbone | LVVIS val | LVVIS test | Youtube-VIS2019 | Youtube-VIS2021 | OVIS | weights | |
---|---|---|---|---|---|---|---|
OV2Seg | ResNet50 | 14.2 | 11.4 | 27.2 | 23.6 | 11.2 | link |
OV2Seg | Swin-B | 21.1 | 16.4 | 37.6 | 33.9 | 17.5 |
This repo is based on Mask2Former, detectron2, and Detic. Thanks for their great work!