Skip to content

Latest commit

 

History

History
111 lines (79 loc) · 3.91 KB

Baseline.md

File metadata and controls

111 lines (79 loc) · 3.91 KB

OV2Seg

OV2Seg is the first end-to-end Open-Vocabulary video instance segmentation model, that can segment, track, and classify objects from novel categories with a Memory-Induced Transformer architecture.

Setup

Installation

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this. Note, please check PyTorch version matches that is required by Detectron2.
  • Detectron2: follow Detectron2 installation instructions.
  • pip install -r requirements.txt

This is an example of how to setup a conda environment.

conda create --name ov2seg python=3.8 -y
conda activate ov2seg
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python

# under your working directory
git clone git@github.com:facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ..

# install LVIS API
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/lvis-dataset/lvis-api.git

# clone this repo
git clone git@github.com:haochenheheda/LVVIS.git
cd LVVIS
pip install -r requirements.txt
cd ov2seg/modeling/pixel_decoder/ops
sh make.sh
cd ../../../..

Data Preparation

Structure for dataset

datasets
|-- LVVIS
|-- coco
|-- lvis
|-- metadata

the metadata contains pre-computed classifiers for each dataset, which are generated by DetPro. If you want to generate customer classifiers, please follow this project.

LVIS instance segmentation

Please download COCO and LVIS dataset following the instructions on detectron2.

LV-VIS

Download the LV-VIS validation videos and annotations, and organize the files according to the following structure.

datasets/LVVIS/
`-- val
    |-- JPEGImages
    |-- val_instances.json
    |-- image_val_instances.json # for image oracle evaluation

Pretrained Model Preparation

Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training, please download the models and place them under models/, and following this tool to convert the format.

models
`-- resnet50_miil_21k.pkl

Train

We provide a script scripts/train.sh, that is made to train the OV2Seg model on LVIS dataset.

sh scripts/train.sh

Test

To evaluate a model's performance, use

sh scripts/eval_video.sh  # evaluate on LV-VIS val set (video)
sh scripts/eval_image.sh  # evaluate on LV-VIS val set (image oracle)

You are expected to get results like this:

Backbone LVVIS val LVVIS test Youtube-VIS2019 Youtube-VIS2021 OVIS weights
OV2Seg ResNet50 14.2 11.4 27.2 23.6 11.2 link
OV2Seg Swin-B 21.1 16.4 37.6 33.9 17.5

Acknowledgement

This repo is based on Mask2Former, detectron2, and Detic. Thanks for their great work!