Skip to content

Latest commit

 

History

History
341 lines (231 loc) · 15.3 KB

README.md

File metadata and controls

341 lines (231 loc) · 15.3 KB

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (CVPR 2025)

Evaluation code released - MonoTrap released - training code coming soon


🚨 This repository will contain download links to our code, and trained deep stereo models of our work "Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail", CVPR 2025

by Luca Bartolomei1,2, Fabio Tosi2, Matteo Poggi1,2, and Stefano Mattoccia1,2

Advanced Research Center on Electronic System (ARCES)1 University of Bologna2

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (CVPR 2025)

Project Page | Paper

Alt text

Stereo Anywhere: Combining Monocular and Stereo Strenghts for Robust Depth Estimation. Our model achieves accurate results on standard conditions (on Middlebury), while effectively handling non-Lambertian surfaces where stereo networks fail (on Booster) and perspective illusions that deceive monocular depth foundation models (on MonoTrap, our novel dataset).

Note: 🚧 Kindly note that this repository is currently in the development phase. We are actively working to add and refine features and documentation. We apologize for any inconvenience caused by incomplete or missing elements and appreciate your patience as we work towards completion.

📑 Table of Contents

🎬 Introduction

We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.

Contributions:

  • A novel deep stereo architecture leveraging monocular depth VFMs to achieve strong generalization capabilities and robustness to challenging conditions.

  • Novel data augmentation strategies designed to enhance the robustness of our model to textureless regions and non-Lambertian surfaces.

  • A challenging dataset with optical illusion, which is particularly challenging for monocular depth with VFMs.

  • Extensive experiments showing Stereo Anywhere's superior generalization and robustness to conditions critical for either stereo or monocular approaches.

🖋️ If you find this code useful in your research, please cite:

@article{bartolomei2024stereo,
  title={Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail},
  author={Bartolomei, Luca and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano},
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

📥 Pretrained Models

Here, you will be able to download the weights of our proposal trained on Sceneflow.

You can download our pretrained models here.

📝 Code

The Training section provides a script to train our model using Sceneflow dataset, while our Test section contains scripts to evaluate disparity estimation on datasets like KITTI, Middlebury, ETH3D.

Please refer to each section for detailed instructions on setup and execution.

Warning:

  • With the latest updates in PyTorch, slight variations in the quantitative results compared to the numbers reported in the paper may occur.

🛠️ Setup Instructions

  1. Dependencies: Ensure that you have installed all the necessary dependencies. The list of dependencies can be found in the ./requirements.txt file.
  2. Set scripts variables: Each script needs the path to the virtual environment (if any) and to the dataset. Please set those variables before running the script.

💾 Datasets

We used Sceneflow dataset for training and eight datasets for evaluation.

Specifically, we evaluate our proposal and competitors using:

Sceneflow - FlyingThings (subset)

Download Images and Disparities from the official website.

Unzip the archives, then you will get a data structure as follows:

FlyingThings3D_subset
├── val
└── train
    ├── disparity
    │   ├── left
    │   └── right
    └── image_clean
        ├── left
        └── right

Sceneflow - Monkaa

Download Images and Disparities from the official website.

Unzip the archives, then you will get a data structure as follows:

Monkaa
├── disparity
└── frames_cleanpass

Sceneflow - Driving

Download Images and Disparities from the official website.

Unzip the archives, then you will get a data structure similar to the data structure of Monkaa.

Middlebury 2014 - MiddEval3 (Half resolution)

Download Images, Left GT, Right GT from Middlebury Website, then unzip the packages.

After that, you will get a data structure as follows:

MiddEval3
├── trainingH
│   ├── Adirondack
│   │   ├── im0.png
│   │   ├── im1.png
│   │   ├── mask0nocc.png
│   │   └── disp0GT.pfm
│   ├── ...
│   └── Vintage
└── testH

Middlebury 2021

Download the Middlebury 2021 Archive from Middlebury Website. Then download our occlusion masks obtained using LRC. After that, unzip all archives.

You will get a data structure similar to MiddEval3.

ETH3D

You can download ETH3D dataset following this script:

$ cd PATH_TO_DOWNLOAD
$ wget https://www.eth3d.net/data/two_view_training.7z
$ wget https://www.eth3d.net/data/two_view_training_gt.7z
$ p7zip -d *.7z

After that, you will get a data structure as follows:

eth3d
├── delivery_area_1l
│    ├── im0.png
│    └── ...
...
└── terrains_2s
     └── ...

Note that the script erases 7z files. Further details are available at the official website.

KITTI 2012

Go to official KITTI 2012 website, then using a registered account you will be able to download the stereo 2012 dataset.

After that, you need to add some symbolic links:

cd KITTI2012_PATH
cd training

ln -s colored_0 image_2
ln -s colored_1 image_3
ln -s disp_noc disp_noc_0
ln -s disp_occ disp_occ_0

You will get a data structure similar to KITTI 2015.

KITTI 2015

Go to official KITTI 2015 website, then using a registered account you will be able to download the stereo 2015 dataset.

After that, you will get a data structure as follows:

kitti2015
└── training
    ├── disp_occ_0
    │    ├── 000000_10.png
    |    ...
    │    └── 000199_10.png
    ├── disp_noc_0
    ├── image_2
    └── image_3

Booster

You can download Booster dataset from AMSActa (Booster Dataset Labeled - 19GB). Please refer to the official website for further details. After that, unzip the archive to your preferred folder.

You will get a data structure as follows:

Booster
├── test
└── train
    ├── unbalanced
    └── balanced
         ├── Bathroom
         ...
         └── Washer

LayeredFlow

You can download LayeredFlow dataset from the official website.

Unzip the archive, then you will get a data structure as follows:

public_layeredflow_benchmark
├── calib
├── test
└── val
    ├── 0
    ...
    └── 199

MonoTrap

You can download our MonoTrap dataset from our drive.

Unzip the archive, then you will get a data structure as follows:

MonoTrap
└── validation
    ├── RealTrap
    └── CraftedTrap

🚆 Training

We will provide futher information to train Stereo Anywhere soon.

🚀 Test

To evaluate StereoAnywhere with all datasets except MonoTrap use this snippet:

python test.py --datapath <DATAPATH> --dataset <DATASET> \ 
--stereomodel stereoanywhere --loadstereomodel <STEREO_MODEL_PATH> \
--monomodel DAv2 --loadmonomodel <MONO_MODEL_PATH> \
--iscale <ISCALE> --oscale <OSCALE> --normalize --iters 32 \
--vol_n_masks 8 --n_additional_hourglass 0 \
--use_aggregate_mono_vol --vol_downsample 0 \
--mirror_conf_th 0.98  --use_truncate_vol --mirror_attenuation 0.9 

where DATAPATH is the path to the dataset, DATASET is the name of the dataset (i.e., middlebury, middlebury2021, eth3d, kitti2012, kitti2015, booster, layeredflow), STEREO_MODEL_PATH is the path to our pretrained sceneflow checkpoint, MONO_MODEL_PATH is the path to the DAv2-Large pretrained monocular model, ISCALE is the resolution of input images (use 4 for Booster, 8 for LayeredFlow, 1 for others), OSCALE is the resolution of evaluation (use 4 for Booster, 8 for LayeredFlow, 1 for others).

To evaluate StereoAnywhere with our MonoTrap dataset use this snippet:

python test_monotrap.py --datapath <DATAPATH> \ 
--stereomodel stereoanywhere --loadstereomodel <STEREO_MODEL_PATH> \
--monomodel DAv2 --loadmonomodel <MONO_MODEL_PATH> \
--iscale <ISCALE> --oscale <OSCALE> --normalize --iters 32 \
--vol_n_masks 8 --n_additional_hourglass 0 \
--use_aggregate_mono_vol --vol_downsample 0 \
--mirror_conf_th 0.98  --use_truncate_vol --mirror_attenuation 0.9 

🎨 Qualitative Results

In this section, we present illustrative examples that demonstrate the effectiveness of our proposal.


Qualitative Results -- Zero-Shot Generalization. Predictions by state-of-the-art models and Stereo Anywhere. In particular the first row shows an extremely challenging case for SceneFlow-trained models, where Stereo Anywhere achieves accurate disparity maps thanks to VFM priors.


Qualitative results -- Zero-Shot non-Lambertian Generalization. Predictions by state-of-the-art models and Stereo Anywhere. Our proposal is the only stereo model correctly perceiving the mirror and transparent railing.


Qualitative results -- MonoTrap. The figure shows three samples where Depth Anything v2 fails while Stereo Anywhere does not.

✉️ Contacts

For questions, please send an email to luca.bartolomei5@unibo.it

🙏 Acknowledgements

We would like to extend our sincere appreciation to the authors of the following projects for making their code available, which we have utilized in our work:

  • We would like to thank the authors of RAFT-Stereo for providing their code, which has been inspirational for our stereo matching architecture.
  • We would like to thank also the authors of Depth Anything V2 for providing their incredible monocular depth estimation network that fuels our proposal Stereo Anywhere.