Lingdong Kong1,2
Youquan Liu3
Lai Xing Ng4
Benoit R. Cottereau5,6
Wei Tsang Ooi1
1National University of Singapore
2CNRS@CREATE
3Hochschule Bremerhaven
4Institute for Infocomm Research, A*STAR
5IPAL, CNRS IRL 2955, Singapore
6CerCo, CNRS UMR 5549, Universite Toulouse III
OpenESS
is an open-vocabulary event-based semantic segmentation (ESS) framework that synergizes information from image, text, and event-data domains to enable scalable ESS in an open-world, annotation-efficient manner.
![]() |
![]() |
![]() |
![]() |
---|---|---|---|
Input Event Stream | “Driveable” | “Car” | “Manmade” |
![]() |
![]() |
![]() |
![]() |
Zero-Shot ESS | “Walkable” | “Barrier” | “Flat” |
Kindly refer to INSTALL.md for the installation details.
-
Step 1: Download the DSEC dataset from the official dataset page. Below, we summarize the links used for downloading each of the resources:
Training Data Link Size Description Events download 125 GB The raw event data in .h5 format Frames download 216 GB The RGB frames in .png format Disparities download 12 GB The disparities between left and right sensors Semantic Masks download 88.6 MB The ground truth semantic segmentation labels Test Data Link Size Description Events download 27 GB The raw event data in .h5 format Frames download 43 GB The RGB frames in .png format Semantic Masks download 28.9 MB The ground truth semantic segmentation labels -
Step 2: Link the dataset to path
./data
. Your dataset folder should end up aligning with the following dataset structure:./data/DSEC ├── test │ ├── zurich_city_13_a │ │ ├── events │ │ │ └── left │ │ │ └── events.h5 │ │ ├── images │ │ │ └── left │ │ │ │── 000000.png │ │ │ │── ... │ │ │ └── 000378.png │ │ ├── images_aligned │ │ │ └── left │ │ │ │── 000000.png │ │ │ │── ... │ │ │ └── 000378.png │ │ ├── reconstructions │ │ │ └── left │ │ │ │── 000000.png │ │ │ │── ... │ │ │ └── 000378.png │ │ └── semantic │ │ └── left │ │ │── 000000.png │ │ │── ... │ │ └── 000378.png │ ├── zurich_city_14_c │ └── zurich_city_15_a └── train ├── zurich_city_00_a ├── zurich_city_01_a ├── zurich_city_02_a ├── zurich_city_04_a ├── zurich_city_05_a ├── zurich_city_06_a ├── zurich_city_07_a └── zurich_city_08_a
-
Step 3: Prepare frame data that aligns with the events. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named
images_aligned
.Additionally, we provided our processed DSEC-Semantic frame data at this Google Drive link (~4.95 GB).
-
Step 4: Prepare the zero-shot semantic labels for T2E: Text-to-Event Consistency Regularization. For more details, kindly refer to FC-CLIP.md.
Additionally, we provided our generated DSEC-Semantic T2E labels at this Google Drive link (~47.5 MB).
-
Step 5: Prepare the event reconstruction data. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named
images_aligned
.The pretrained E2VID model can be downloaded from this link and should be placed under the folder
/e2vid/pretrained/
.Additionally, we provided our processed DSEC-Semantic event reconstruction data at this Google Drive link (~2.41 GB).
-
Step 6: Generate the semantic superpixels of SAM for DSEC-Semantic. You should first download the pretrained SAM model from this link.
Next, run the following scripts to generate the superpixels:
# for training set python data_preparation/superpixel_generation_dsec_sam.py -r data/DSEC/train # for test set python data_preparation/superpixel_generation_dsec_sam.py -r data/DSEC/test
The generated superpixels should be placed in the folder named
sp_sam_rgb
. -
Step 7: Generate the semantic superpixels of SLIC for DSEC-Semantic. You can directly run the following script to generate the superpixels:
python data_preparation/superpixel_segmenter_dsec_slic.py --worker $WORKER_NUM --num_segments $SEGMENTS_NUM
The generated superpixels should be placed in the folder named
sp_slic_rgb
.
To summarize, for each of the sequences in DSEC-Semantic, we expect that you prepare the following data for running the experiments:
./sequence_name
├── events
├── images
├── images_aligned
├── pl_fcclip_rgb
├── reconstructions
├── semantic
├── sp_sam_rgb
└── sp_slic_rgb
-
Step 1: Download the DDD17 dataset from the official dataset page and/or from the Ev-SegNet paper.
-
Step 2: Link the dataset to path ./data. Your dataset folder should end up aligning with the following dataset structure:
./data/DDD17 ├── dir0 │ ├── events.dat.t │ ├── events.dat.xyp │ ├── index │ │ │── index_10ms.npy │ │ │── index_50ms.npy │ │ └── index_250ms.npy │ ├── images │ │ │── img_00000002.png │ │ │── ... │ │ └── img_00011178.png │ ├── images_aligned │ │ │── img_00000002.png │ │ │── ... │ │ └── img_00011178.png │ ├── reconstructions │ │ │── img_00000002.png │ │ │── . . . │ │ └── img_00011178.png │ └── segmentation_masks │ │── img_00000002.png │ │── . . . │ └── img_00011178.png ├── dir1 ├── dir3 ├── dir4 ├── dir6 └── dir7
-
Step 3: Prepare frame data that aligns with the events. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named
images_aligned
. Additionally, we provided our processed DDD17-Seg frame data at this Google Drive link (~1.5 GB). -
Step 4: Prepare the zero-shot semantic labels for T2E: Text-to-Event Consistency Regularization. For more details, kindly refer to FC-CLIP.md.
Additionally, we provided our generated DDD17-Seg T2E labels at this Google Drive link (~29.6 MB).
-
Step 5: Prepare the event reconstruction data. Please follow the same procedure from Sun et al. (ESS: Learning Event-based Semantic Segmentation from Still Images), and place the processed frame data into the folder named
images_aligned
.The pretrained E2VID model can be downloaded from this link and should be placed under the folder
/e2vid/pretrained/
.Additionally, we provided our processed DDD17-Seg event reconstruction data at this Google Drive link (~1.14 GB).
-
Step 6: Generate the semantic superpixels of SAM for DDD17-Seg. You should first download the pretrained SAM model from this link.
Next, run the following scripts to generate the superpixels:
# if you use a single GPU python data_preparation/superpixel_generation_ddd17_sam.py # if you use multi-GPUs python data_preparation/superpixel_generation_ddd17_sam_ddp.py -r data/DDD17 \ -p pretrained_checkpoints/sam_vit_h_4b8939.pth --skip_exist
The generated superpixels should be placed in the folder named
sp_sam_rgb
. -
Step 7: Generate the semantic superpixels of SLIC for DDD17-Seg. You can directly run the following script to generate the superpixels:
python data_preparation/superpixel_segmenter_ddd17_slic.py --workers $WORKER_NUM --num_segments $SEGMENTS_NUM
The generated superpixels should be placed in the folder named
sp_slic_rgb
.
To summarize, for each of the sequences in DDD17-Seg, we expect that you prepare the following data for running the experiments:
./sequence_name
├── events.dat.t
├── events.dat.xyp
├── images
├── images_aligned
├── index
├── pl_fcclip_rgb
├── reconstructions
├── segmentation_masks
├── sp_sam_rgb
└── sp_slic_rgb
The first stage of our framework is to do the F2E: Frame-to-Event Contrastive Distillation and T2E: Text-to-Event Consistency Regularization.
To help you with this pretraining process, we have prepared the ready-to-use configuration files (in .yaml
format). You can find these files inside the config/pretrain
directory.
We include the settings for pretraining on the two datasets, DSEC-Semantic and DDD17-Seg, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.
Important: Please make sure that the dataset directory structure strictly follows our required format.
The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:
Configuration | Valid Range | Options |
---|---|---|
Log Directory | dir -> log |
- |
Event Representation | clip -> config_option |
{'frame2recon' , 'frame2voxel' , 'recon2voxel' } |
Superpixel Source | clip -> superpixel_sources |
{'sp_sam_rgb' , 'sp_slic_rgb' } |
Superpixel Size | clip -> superpixel_size |
- |
F2E Pretrained Model | clip -> image_weights |
{'moco_v1' , 'moco_v2' , 'swav' , 'deepcluster_v2' , 'dino' } |
T2E Generation Model | clip -> pl_sources |
{'pl_fcclip_rgb' , 'pl_maskclip_rgb' } |
To launch the pretraining experiments, simply run the following command with your selected configuration file:
python train.py --settings_file ${SELECTED_CONFIG_PATH}
Currently, the code only supports single-GPU pretraining. If you would like to use multiple GPUs together for pretraining, please adjust the code accordingly.
In our experimental settings, the linear probing is used for a lightweight evaluation of pretrained model weights; only a small set of linear layers are trainable while the model backbone remains frozen.
This approach provides a quick validation of the quality of representations learned during the pretraining.
To help you with this linear probing process, we have prepared the ready-to-use configuration files (in .yaml
format). You can find these files inside the config/linear_probe
directory.
Similar to the pretraining stage, we include settings for both DSEC-Semantic and DDD17-Seg datasets, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.
The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:
Configuration | Valid Range | Options |
---|---|---|
Log Directory | dir -> log |
- |
Event Representation | clip -> config_option |
{'frame2recon' , 'frame2voxel' , 'recon2voxel' } |
Superpixel Source | clip -> superpixel_sources |
{'sp_sam_rgb' , 'sp_slic_rgb' } |
Pretrained Weight | clip -> pre_trained_backbone |
- |
To launch the linear probing experiments, simply run the following command with your selected configuration file:
python train.py --settings_file ${SELECTED_CONFIG_PATH}
In the model fine-tuning stage, we evaluate the performance of our model under different levels of supervision, i.e., varying the proportion of ground truth semantic labels used.
As usual, to help you with this linear probing process, we have prepared the ready-to-use configuration files (in .yaml
format). You can find these files inside the config/finetunes
directory.
Similar to the pretraining and linear probing stages, we include settings for both DSEC-Semantic and DDD17-Seg datasets, where the superpixels (constructed with either SAM or SLIC) and event representations have already been specified in the configuration files.
The main settings related can be adjusted directly in the configuration files. Below, we describe the configurable options and their valid ranges in detail:
Configuration | Valid Range | Options | Mapping |
---|---|---|---|
Log Directory | dir -> log |
- | - |
Event Representation | clip -> config_option |
{'frame2recon' , 'frame2voxel' , 'recon2voxel' } |
- |
Superpixel Source | clip -> superpixel_sources |
{'sp_sam_rgb' , 'sp_slic_rgb' } |
- |
Pretrained Weight | clip -> pre_trained_backbone |
- | |
Label Ratio | clip -> skip_ratio |
{'100' , '20' , '10' , '5' , '1' } |
{ '1' : 100%, '5' : 20%, '10' : 10%, '20' : 5%, '100' : 1% } |
Training Epochs | clip -> num_epochs |
- | - |
To launch the fine-tuning experiments, simply run the following command with your selected configuration file:
python train.py --settings_file ${SELECTED_CONFIG_PATH}
Here, we provide several trained models and the training logs to help you better understand the experiments, as well as reproduce the results reported in the paper. These materials are open-sourced at this Google Drive link.
Additionally, to help you with the result visualizations, we also provide our processed color labels for both DSEC-Semantic and DDD17-Seg. We use the color maps following their original papers. These materials are open-sourced at this (for DSEC-Semantic) and this (for DDD17-Seg) Google Drive links.
![]() |
---|
Method | Venue | DDD17 Acc | DDD17 mIoU | DSEC Acc | DSEC mIoU |
---|---|---|---|---|---|
MaskCLIP | ECCV'22 | 81.29 | 31.90 | 58.96 | 21.97 |
FC-CLIP | NeurIPS'23 | 88.66 | 51.12 | 79.20 | 39.42 |
OpenESS | Ours | 90.51 | 53.93 | 86.18 | 43.31 |
Method | Venue | DDD17 Acc | DDD17 mIoU | DSEC Acc | DSEC mIoU |
---|---|---|---|---|---|
Ev-SegNet | CVPRW'19 | 89.76 | 54.81 | 88.61 | 51.76 |
E2VID | TPAMI'19 | 85.84 | 48.47 | 80.06 | 44.08 |
Vid2E | CVPR'20 | 90.19 | 56.01 | - | - |
EVDistill | CVPR'21 | - | 58.02 | - | - |
DTL | ICCV'21 | - | 58.80 | - | - |
PVT-FPN | ICCV'21 | 94.28 | 53.89 | - | - |
SpikingFCN | NCE'22 | - | 34.20 | - | - |
EV-Transfer | RA-L'22 | 51.90 | 15.52 | 63.00 | 24.37 |
ESS | ECCV'22 | 88.43 | 53.09 | 84.17 | 45.38 |
ESS-Sup | ECCV'22 | 91.08 | 61.37 | 89.37 | 53.29 |
P2T-FPN | TPAMI'23 | 94.57 | 54.64 | - | - |
EvSegformer | TIP'23 | 94.72 | 54.41 | - | - |
HMNet-B | CVPR'23 | - | - | 88.70 | 51.20 |
HMNet-L | CVPR'23 | - | - | 89.80 | 55.00 |
HALSIE | WACV'24 | 92.50 | 60.66 | 89.01 | 52.43 |
Method | Venue | DDD17 Acc | DDD17 mIoU | DSEC Acc | DSEC mIoU |
---|---|---|---|---|---|
MaskCLIP | ECCV'22 | 90.50 | 61.27 | 89.81 | 55.01 |
FC-CLIP | NeurIPS'23 | 90.68 | 62.01 | 89.97 | 55.67 |
OpenESS | Ours | 91.05 | 63.00 | 90.21 | 57.21 |
![]() |
---|
If you find this work helpful, please kindly consider citing our paper:
@inproceedings{kong2024openess,
title = {OpenESS: Event-Based Semantic Scene Understanding with Open Vocabularies},
author = {Lingdong Kong and Youquan Liu and Lai Xing Ng and Benoit R. Cottereau and Wei Tsang Ooi},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages = {15686-15698},
year = 2024,
}
This work is under the Apache License Version 2.0, while some specific implementations in this codebase might be with other licenses. Kindly refer to the original papers and repositories for a more careful check, if you are using our code for commercial matters.
This work is under the programme DesCartes and is supported by the National Research Foundation, Prime Minister’s Office, Singapore, under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. ❤️