OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies

Lingdong Kong^1,2 Youquan Liu³ Lai Xing Ng⁴ Benoit R. Cottereau^5,6 Wei Tsang Ooi¹
¹National University of Singapore ²CNRS@CREATE ³Hochschule Bremerhaven ⁴Institute for Infocomm Research, A*STAR ⁵IPAL, CNRS IRL 2955, Singapore ⁶CerCo, CNRS UMR 5549, Universite Toulouse III

About

OpenESS is an open-vocabulary event-based semantic segmentation (ESS) framework that synergizes information from image, text, and event-data domains to enable scalable ESS in an open-world, annotation-efficient manner.


Input Event Stream	“Driveable”	“Car”	“Manmade”

Zero-Shot ESS	“Walkable”	“Barrier”	“Flat”

⚙️ Installation

Kindly refer to INSTALL.md for the installation details.

♨️ Data Preparation

DSEC-Semantic

Step 1: Download the DSEC dataset from the official dataset page. Below, we summarize the links used for downloading each of the resources:

Training Data	Link	Size	Description
Events	download	125 GB	The raw event data in .h5 format
Frames	download	216 GB	The RGB frames in .png format
Disparities	download	12 GB	The disparities between left and right sensors
Semantic Masks	download	88.6 MB	The ground truth semantic segmentation labels

Test Data	Link	Size	Description
Events	download	27 GB	The raw event data in .h5 format
Frames	download	43 GB	The RGB frames in .png format
Semantic Masks	download	28.9 MB	The ground truth semantic segmentation labels

Step 2: Link the dataset to path ./data. Your dataset folder should end up aligning with the following dataset structure:

./data/DSEC
        ├── test
        │     ├── zurich_city_13_a
        │     │    ├── events
        │     │    │    └── left
        │     │    │          └── events.h5
        │     │    ├── images
        │     │    │    └── left
        │     │    │          │── 000000.png
        │     │    │          │── ...
        │     │    │          └── 000378.png
        │     │    ├── images_aligned
        │     │    │    └── left
        │     │    │    │── 000000.png
        │     │    │    │── ...
        │     │    │    └── 000378.png
        │     │    ├── reconstructions
        │     │    │    └── left
        │     │    │          │── 000000.png
        │     │    │          │── ...
        │     │    │          └── 000378.png
        │     │    └── semantic
        │     │         └── left
        │     │               │── 000000.png
        │     │               │── ...
        │     │               └── 000378.png
        │     ├── zurich_city_14_c
        │     └── zurich_city_15_a
        └── train
              ├── zurich_city_00_a
              ├── zurich_city_01_a
              ├── zurich_city_02_a
              ├── zurich_city_04_a
              ├── zurich_city_05_a
              ├── zurich_city_06_a
              ├── zurich_city_07_a
              └── zurich_city_08_a

Step 3: Prepare frame data that aligns with the events. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named images_aligned.

Additionally, we provided our processed DSEC-Semantic frame data at this Google Drive link (~4.95 GB).
Step 4: Prepare the zero-shot semantic labels for T2E: Text-to-Event Consistency Regularization. For more details, kindly refer to FC-CLIP.md.

Additionally, we provided our generated DSEC-Semantic T2E labels at this Google Drive link (~47.5 MB).
Step 5: Prepare the event reconstruction data. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named images_aligned.

The pretrained E2VID model can be downloaded from this link and should be placed under the folder /e2vid/pretrained/.

Additionally, we provided our processed DSEC-Semantic event reconstruction data at this Google Drive link (~2.41 GB).
Step 6: Generate the semantic superpixels of SAM for DSEC-Semantic. You should first download the pretrained SAM model from this link.

Next, run the following scripts to generate the superpixels:
```
# for training set
python data_preparation/superpixel_generation_dsec_sam.py -r data/DSEC/train

# for test set
python data_preparation/superpixel_generation_dsec_sam.py -r data/DSEC/test
```
The generated superpixels should be placed in the folder named sp_sam_rgb.
Step 7: Generate the semantic superpixels of SLIC for DSEC-Semantic. You can directly run the following script to generate the superpixels:
```
python data_preparation/superpixel_segmenter_dsec_slic.py --worker $WORKER_NUM --num_segments $SEGMENTS_NUM
```
The generated superpixels should be placed in the folder named sp_slic_rgb.

To summarize, for each of the sequences in DSEC-Semantic, we expect that you prepare the following data for running the experiments:

./sequence_name
    ├── events
    ├── images
    ├── images_aligned
    ├── pl_fcclip_rgb
    ├── reconstructions
    ├── semantic
    ├── sp_sam_rgb
    └── sp_slic_rgb

DDD17-Seg

Step 1: Download the DDD17 dataset from the official dataset page and/or from the Ev-SegNet paper.

Step 2: Link the dataset to path ./data. Your dataset folder should end up aligning with the following dataset structure:

./data/DDD17
        ├── dir0
        │    ├── events.dat.t
        │    ├── events.dat.xyp
        │    ├── index
        │    │     │── index_10ms.npy
        │    │     │── index_50ms.npy
        │    │     └── index_250ms.npy
        │    ├── images
        │    │     │── img_00000002.png
        │    │     │── ...
        │    │     └── img_00011178.png
        │    ├── images_aligned
        │    │     │── img_00000002.png
        │    │     │── ...
        │    │     └── img_00011178.png
        │    ├── reconstructions
        │    │     │── img_00000002.png
        │    │     │── . . .
        │    │     └── img_00011178.png
        │    └── segmentation_masks
        │          │── img_00000002.png
        │          │── . . .
        │          └── img_00011178.png
        ├── dir1
        ├── dir3
        ├── dir4
        ├── dir6
        └── dir7

Step 3: Prepare frame data that aligns with the events. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named images_aligned. Additionally, we provided our processed DDD17-Seg frame data at this Google Drive link (~1.5 GB).
Step 4: Prepare the zero-shot semantic labels for T2E: Text-to-Event Consistency Regularization. For more details, kindly refer to FC-CLIP.md.

Additionally, we provided our generated DDD17-Seg T2E labels at this Google Drive link (~29.6 MB).
Step 5: Prepare the event reconstruction data. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named images_aligned.

The pretrained E2VID model can be downloaded from this link and should be placed under the folder /e2vid/pretrained/.

Additionally, we provided our processed DDD17-Seg event reconstruction data at this Google Drive link (~1.14 GB).
Step 6: Generate the semantic superpixels of SAM for DDD17-Seg. You should first download the pretrained SAM model from this link.

Next, run the following scripts to generate the superpixels:
```
# if you use a single GPU
python data_preparation/superpixel_generation_ddd17_sam.py

# if you use multi-GPUs
python data_preparation/superpixel_generation_ddd17_sam_ddp.py -r data/DDD17 \
-p pretrained_checkpoints/sam_vit_h_4b8939.pth --skip_exist
```
The generated superpixels should be placed in the folder named sp_sam_rgb.
Step 7: Generate the semantic superpixels of SLIC for DDD17-Seg. You can directly run the following script to generate the superpixels:
```
python data_preparation/superpixel_segmenter_ddd17_slic.py --workers $WORKER_NUM --num_segments $SEGMENTS_NUM
```
The generated superpixels should be placed in the folder named sp_slic_rgb.

To summarize, for each of the sequences in DDD17-Seg, we expect that you prepare the following data for running the experiments:

./sequence_name
    ├── events.dat.t
    ├── events.dat.xyp
    ├── images
    ├── images_aligned
    ├── index
    ├── pl_fcclip_rgb
    ├── reconstructions
    ├── segmentation_masks
    ├── sp_sam_rgb
    └── sp_slic_rgb

🚀 Getting Started

Stage 1: Model Pretraining

The first stage of our framework is to do the F2E: Frame-to-Event Contrastive Distillation and T2E: Text-to-Event Consistency Regularization.

To help you with this pretraining process, we have prepared the ready-to-use configuration files (in .yaml format). You can find these files inside the config/pretrain directory.

We include the settings for pretraining on the two datasets, DSEC-Semantic and DDD17-Seg, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.

Important: Please make sure that the dataset directory structure strictly follows our required format.

⚙️ Configuration Options

The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:

Configuration	Valid Range	Options
Log Directory	`dir` -> `log`	-
Event Representation	`clip` -> `config_option`	{`'frame2recon'`, `'frame2voxel'`, `'recon2voxel'`}
Superpixel Source	`clip` -> `superpixel_sources`	{`'sp_sam_rgb'`, `'sp_slic_rgb'`}
Superpixel Size	`clip` -> `superpixel_size`	-
F2E Pretrained Model	`clip` -> `image_weights`	{`'moco_v1'`, `'moco_v2'`, `'swav'`, `'deepcluster_v2'`, `'dino'`}
T2E Generation Model	`clip` -> `pl_sources`	{`'pl_fcclip_rgb'`, `'pl_maskclip_rgb'`}

🏃 Ready to Run

To launch the pretraining experiments, simply run the following command with your selected configuration file:

python train.py --settings_file ${SELECTED_CONFIG_PATH}

Currently, the code only supports single-GPU pretraining. If you would like to use multiple GPUs together for pretraining, please adjust the code accordingly.

Stage 2: Linear Probing

In our experimental settings, the linear probing is used for a lightweight evaluation of pretrained model weights; only a small set of linear layers are trainable while the model backbone remains frozen.

This approach provides a quick validation of the quality of representations learned during the pretraining.

To help you with this linear probing process, we have prepared the ready-to-use configuration files (in .yaml format). You can find these files inside the config/linear_probe directory.

Similar to the pretraining stage, we include settings for both DSEC-Semantic and DDD17-Seg datasets, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.

⚙️ Configuration Options

The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:

Configuration	Valid Range	Options
Log Directory	`dir` -> `log`	-
Event Representation	`clip` -> `config_option`	{`'frame2recon'`, `'frame2voxel'`, `'recon2voxel'`}
Superpixel Source	`clip` -> `superpixel_sources`	{`'sp_sam_rgb'`, `'sp_slic_rgb'`}
Pretrained Weight	`clip` -> `pre_trained_backbone`	-

🏃 Ready to Run

To launch the linear probing experiments, simply run the following command with your selected configuration file:

python train.py --settings_file ${SELECTED_CONFIG_PATH}

Stage 3: Fine-Tuning

In the model fine-tuning stage, we evaluate the performance of our model under different levels of supervision, i.e., varying the proportion of ground truth semantic labels used.

As usual, to help you with this linear probing process, we have prepared the ready-to-use configuration files (in .yaml format). You can find these files inside the config/finetunes directory.

Similar to the pretraining and linear probing stages, we include settings for both DSEC-Semantic and DDD17-Seg datasets, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.

⚙️ Configuration Options

The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:

Configuration	Valid Range	Options	Mapping
Log Directory	`dir` -> `log`	-	-
Event Representation	`clip` -> `config_option`	{`'frame2recon'`, `'frame2voxel'`, `'recon2voxel'`}	-
Superpixel Source	`clip` -> `superpixel_sources`	{`'sp_sam_rgb'`, `'sp_slic_rgb'`}	-
Pretrained Weight	`clip` -> `pre_trained_backbone`	-
Label Ratio	`clip -> skip_ratio`	{`'100'`, `'20'`, `'10'`, `'5'`, `'1'`}	{ `'1'`: 100%, `'5'`: 20%, `'10'`: 10%, `'20'`: 5%, `'100'`: 1% }
Training Epochs	`clip` -> `num_epochs`	-	-

🏃 Ready to Run

To launch the fine-tuning experiments, simply run the following command with your selected configuration file:

python train.py --settings_file ${SELECTED_CONFIG_PATH}

Here, we provide several trained models and the training logs to help you better understand the experiments, as well as reproduce the results reported in the paper. These materials are open-sourced at this Google Drive link.

Additionally, to help you with the result visualizations, we also provide our processed color labels for both DSEC-Semantic and DDD17-Seg. We use the color maps following their original papers. These materials are open-sourced at this (for DSEC-Semantic) and this (for DDD17-Seg) Google Drive links.

Benchmark

OpenESS Framework

Annotation-Free ESS

Method	Venue	DDD17 Acc	DDD17 mIoU	DSEC Acc	DSEC mIoU
MaskCLIP	ECCV'22	81.29	31.90	58.96	21.97
FC-CLIP	NeurIPS'23	88.66	51.12	79.20	39.42
OpenESS	Ours	90.51	53.93	86.18	43.31

Fully-Supervised ESS

Method	Venue	DDD17 Acc	DDD17 mIoU	DSEC Acc	DSEC mIoU
Ev-SegNet	CVPRW'19	89.76	54.81	88.61	51.76
E2VID	TPAMI'19	85.84	48.47	80.06	44.08
Vid2E	CVPR'20	90.19	56.01	-	-
EVDistill	CVPR'21	-	58.02	-	-
DTL	ICCV'21	-	58.80	-	-
PVT-FPN	ICCV'21	94.28	53.89	-	-
SpikingFCN	NCE'22	-	34.20	-	-
EV-Transfer	RA-L'22	51.90	15.52	63.00	24.37
ESS	ECCV'22	88.43	53.09	84.17	45.38
ESS-Sup	ECCV'22	91.08	61.37	89.37	53.29
P2T-FPN	TPAMI'23	94.57	54.64	-	-
EvSegformer	TIP'23	94.72	54.41	-	-
HMNet-B	CVPR'23	-	-	88.70	51.20
HMNet-L	CVPR'23	-	-	89.80	55.00
HALSIE	WACV'24	92.50	60.66	89.01	52.43

Open-Vocabulary ESS

Method	Venue	DDD17 Acc	DDD17 mIoU	DSEC Acc	DSEC mIoU
MaskCLIP	ECCV'22	90.50	61.27	89.81	55.01
FC-CLIP	NeurIPS'23	90.68	62.01	89.97	55.67
OpenESS	Ours	91.05	63.00	90.21	57.21

Qualitative Assessment

Citation

If you find this work helpful, please kindly consider citing our paper:

@inproceedings{kong2024openess,
  title     = {OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies},
  author    = {Lingdong Kong and Youquan Liu and Lai Xing Ng and Benoit R. Cottereau and Wei Tsang Ooi},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     = {15686-15698},
  year      = 2024,
}

License

This work is under the Apache License Version 2.0, while some specific implementations in this codebase might be with other licenses. Kindly refer to the original papers and repositories for a more careful check, if you are using our code for commercial matters.

Acknowledgements

This work is under the programme DesCartes and is supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
DSEC		DSEC
config		config
data_preparation		data_preparation
datasets		datasets
docs		docs
e2vid		e2vid
evaluation		evaluation
fc-clip		fc-clip
models		models
training		training
utils		utils
README.md		README.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies

About

Table of Contents

⚙️ Installation

♨️ Data Preparation

DSEC-Semantic

DDD17-Seg

🚀 Getting Started

Stage 1: Model Pretraining

⚙️ Configuration Options

🏃 Ready to Run

Stage 2: Linear Probing

⚙️ Configuration Options

🏃 Ready to Run

Stage 3: Fine-Tuning

⚙️ Configuration Options

🏃 Ready to Run

Benchmark

OpenESS Framework

Annotation-Free ESS

Fully-Supervised ESS

Open-Vocabulary ESS

Qualitative Assessment

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

ldkong1205/OpenESS

Folders and files

Latest commit

History

Repository files navigation

OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies

About

Table of Contents

⚙️ Installation

♨️ Data Preparation

DSEC-Semantic

DDD17-Seg

🚀 Getting Started

Stage 1: Model Pretraining

⚙️ Configuration Options

🏃 Ready to Run

Stage 2: Linear Probing

⚙️ Configuration Options

🏃 Ready to Run

Stage 3: Fine-Tuning

⚙️ Configuration Options

🏃 Ready to Run

Benchmark

OpenESS Framework

Annotation-Free ESS

Fully-Supervised ESS

Open-Vocabulary ESS

Qualitative Assessment

Citation

License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages