Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation (ICCV 2025)
🚨 This repository will contain download links to our evaluation code, and trained deep models of our work "Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation", ICCV 2025
by Luca Bartolomei1,2, Enrico Mannocci2, Fabio Tosi2, Matteo Poggi1,2, and Stefano Mattoccia1,2
Advanced Research Center on Electronic System (ARCES)1 Department of Computer Science and Engineering (DISI)2
University of Bologna
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation (ICCV 2025)
Proposed Cross-Modal Distillation Strategy. During training, a VFM teacher processes RGB input frames to generate proxy depth labels, which supervise an event-based student model. The student takes aligned event stacks as input and predicts the final depth map.
Note: 🚧 Kindly note that this repository is currently in the development phase. We are actively working to add and refine features and documentation. We apologize for any inconvenience caused by incomplete or missing elements and appreciate your patience as we work towards completion.
Monocular depth perception from cameras is crucial for applications such as autonomous navigation and robotics. While conventional cameras have enabled impressive results, they struggle in highly dynamic scenes and challenging lighting conditions due to limitations like motion blur and low dynamic range. Event cameras, with their high temporal resolution and dynamic range, address these issues but provide sparse information and lack large annotated datasets, making depth estimation difficult.
This project introduces a novel approach to monocular depth estimation with event cameras by leveraging Vision Foundation Models (VFMs) trained on images. The method uses cross-modal distillation to transfer knowledge from image-based VFMs to event-based networks, utilizing spatially aligned data from devices like the DAVIS Camera. Additionally, the project adapts VFMs for event-based depth estimation, proposing both a direct adaptation and a new recurrent architecture. Experiments on synthetic and real datasets demonstrate competitive or state-of-the-art results without requiring expensive depth annotations.
Contributions:
-
A novel cross-modal distillation paradigm that leverages the robust proxy labels obtained from image-based VFMs for monocular depth estimation.
-
An adapting strategy to cast existing image-based VFMs into the event domain effortlessly.
-
A novel recurrent architecture based on an adapted image-based VFM.
-
Adapting VFMs to the event domain yields state-of-the-art performance, and our distillation paradigm is competitive against the supervision from depth sensors.
🖋️ If you find this code useful in your research, please cite:
@InProceedings{Bartolomei_2025_ICCV,
author = {Bartolomei, Luca and Mannocci, Enrico and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano},
title = {Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
}Here, you will be able to download the weights of VFMs for the event domain.
You can download our pretrained models here.
The Test section contains scripts to evaluate depth estimation on MVSEC and DSEC datasets.
Please refer to the section for detailed instructions on setup and execution.
Warning:
- With the latest updates in PyTorch, slight variations in the quantitative results compared to the numbers reported in the paper may occur.
- Dependencies: Ensure that you have installed all the necessary dependencies. The list of dependencies can be found in the
./requirements.txtfile. - Set scripts variables: Each script needs the path to the virtual environment (if any) and to the dataset. Please set those variables before running the script.
- Set config variables: Each JSON config file has a
datapathkey: update it accordingly to your environment.
We used two datasets for evaluation: MVSEC and DSEC.
Download the processed version of MVSEC here. Thanks to the authors of E2DEPTH for the amazing work.
Unzip the archives arranging them as shown in the data structure below:
MVSEC
├── test
│ ├── mvsec_dataset_day2
└── train
├── mvsec_outdoor_day1
├── mvsec_outdoor_night1
├── mvsec_outdoor_night2
└── mvsec_outdoor_night3
Download Images, Events, Disparities, and Calibration Files from the official website.
Unzip the archives, then you will get a data structure as follows:
DSEC
├── train
├── interlaken_00_c
...
└── zurich_city_11_c
To evaluate the tables in our paperuse this snippet:
bash scripts/test.shYou should change the variables inside the script before launching it.
For questions, please send an email to luca.bartolomei5@unibo.it
We would like to extend our sincere appreciation to the authors of the following projects for making their code available, which we have utilized in our work: