Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (CVPR 2025)
🚨 This repository will contain download links to our code, and trained deep stereo models of our work "Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail", CVPR 2025
by Luca Bartolomei1,2, Fabio Tosi2, Matteo Poggi1,2, and Stefano Mattoccia1,2
Advanced Research Center on Electronic System (ARCES)1 University of Bologna2
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (CVPR 2025)

Stereo Anywhere: Combining Monocular and Stereo Strenghts for Robust Depth Estimation. Our model achieves accurate results on standard conditions (on Middlebury), while effectively handling non-Lambertian surfaces where stereo networks fail (on Booster) and perspective illusions that deceive monocular depth foundation models (on MonoTrap, our novel dataset).
Note: 🚧 Kindly note that this repository is currently in the development phase. We are actively working to add and refine features and documentation. We apologize for any inconvenience caused by incomplete or missing elements and appreciate your patience as we work towards completion.
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
Contributions:
-
A novel deep stereo architecture leveraging monocular depth VFMs to achieve strong generalization capabilities and robustness to challenging conditions.
-
Novel data augmentation strategies designed to enhance the robustness of our model to textureless regions and non-Lambertian surfaces.
-
A challenging dataset with optical illusion, which is particularly challenging for monocular depth with VFMs.
-
Extensive experiments showing Stereo Anywhere's superior generalization and robustness to conditions critical for either stereo or monocular approaches.
🖋️ If you find this code useful in your research, please cite:
@article{bartolomei2024stereo,
title={Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail},
author={Bartolomei, Luca and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}
Here, you will be able to download the weights of our proposal trained on Sceneflow.
You can download our pretrained models here.
The Training section provides a script to train our model using Sceneflow dataset, while our Test section contains scripts to evaluate disparity estimation on datasets like KITTI, Middlebury, ETH3D.
Please refer to each section for detailed instructions on setup and execution.
Warning:
- With the latest updates in PyTorch, slight variations in the quantitative results compared to the numbers reported in the paper may occur.
- Dependencies: Ensure that you have installed all the necessary dependencies. The list of dependencies can be found in the
./requirements.txt
file. - Set scripts variables: Each script needs the path to the virtual environment (if any) and to the dataset. Please set those variables before running the script.
We used Sceneflow dataset for training and eight datasets for evaluation.
Specifically, we evaluate our proposal and competitors using:
- 5 indoor/outdoor datasets: Middlebury 2014, Middlebury 2021, ETH3D, KITTI 2012, KITTI 2015;
- two datasets containing non-Lambertian surfaces: Booster and LayeredFlow;
- and finally with MonoTrap our novel stereo dataset specifically designed to challenge monocular depth estimation.
Download Images and Disparities from the official website.
Unzip the archives, then you will get a data structure as follows:
FlyingThings3D_subset
├── val
└── train
├── disparity
│ ├── left
│ └── right
└── image_clean
├── left
└── right
Download Images and Disparities from the official website.
Unzip the archives, then you will get a data structure as follows:
Monkaa
├── disparity
└── frames_cleanpass
Download Images and Disparities from the official website.
Unzip the archives, then you will get a data structure similar to the data structure of Monkaa.
Download Images, Left GT, Right GT from Middlebury Website, then unzip the packages.
After that, you will get a data structure as follows:
MiddEval3
├── trainingH
│ ├── Adirondack
│ │ ├── im0.png
│ │ ├── im1.png
│ │ ├── mask0nocc.png
│ │ └── disp0GT.pfm
│ ├── ...
│ └── Vintage
└── testH
Download the Middlebury 2021 Archive from Middlebury Website. Then download our occlusion masks obtained using LRC. After that, unzip all archives.
You will get a data structure similar to MiddEval3.
You can download ETH3D dataset following this script:
$ cd PATH_TO_DOWNLOAD
$ wget https://www.eth3d.net/data/two_view_training.7z
$ wget https://www.eth3d.net/data/two_view_training_gt.7z
$ p7zip -d *.7z
After that, you will get a data structure as follows:
eth3d
├── delivery_area_1l
│ ├── im0.png
│ └── ...
...
└── terrains_2s
└── ...
Note that the script erases 7z files. Further details are available at the official website.
Go to official KITTI 2012 website, then using a registered account you will be able to download the stereo 2012 dataset.
After that, you need to add some symbolic links:
cd KITTI2012_PATH
cd training
ln -s colored_0 image_2
ln -s colored_1 image_3
ln -s disp_noc disp_noc_0
ln -s disp_occ disp_occ_0
You will get a data structure similar to KITTI 2015.
Go to official KITTI 2015 website, then using a registered account you will be able to download the stereo 2015 dataset.
After that, you will get a data structure as follows:
kitti2015
└── training
├── disp_occ_0
│ ├── 000000_10.png
| ...
│ └── 000199_10.png
├── disp_noc_0
├── image_2
└── image_3
You can download Booster dataset from AMSActa (Booster Dataset Labeled - 19GB). Please refer to the official website for further details. After that, unzip the archive to your preferred folder.
You will get a data structure as follows:
Booster
├── test
└── train
├── unbalanced
└── balanced
├── Bathroom
...
└── Washer
You can download LayeredFlow dataset from the official website.
Unzip the archive, then you will get a data structure as follows:
public_layeredflow_benchmark
├── calib
├── test
└── val
├── 0
...
└── 199
You can download our MonoTrap dataset from our drive.
Unzip the archive, then you will get a data structure as follows:
MonoTrap
└── validation
├── RealTrap
└── CraftedTrap
We will provide futher information to train Stereo Anywhere soon.
To evaluate StereoAnywhere with all datasets except MonoTrap use this snippet:
python test.py --datapath <DATAPATH> --dataset <DATASET> \
--stereomodel stereoanywhere --loadstereomodel <STEREO_MODEL_PATH> \
--monomodel DAv2 --loadmonomodel <MONO_MODEL_PATH> \
--iscale <ISCALE> --oscale <OSCALE> --normalize --iters 32 \
--vol_n_masks 8 --n_additional_hourglass 0 \
--use_aggregate_mono_vol --vol_downsample 0 \
--mirror_conf_th 0.98 --use_truncate_vol --mirror_attenuation 0.9
where DATAPATH
is the path to the dataset, DATASET
is the name of the dataset (i.e., middlebury
, middlebury2021
, eth3d
, kitti2012
, kitti2015
, booster
, layeredflow
), STEREO_MODEL_PATH
is the path to our pretrained sceneflow checkpoint, MONO_MODEL_PATH
is the path to the DAv2-Large pretrained monocular model, ISCALE
is the resolution of input images (use 4 for Booster, 8 for LayeredFlow, 1 for others), OSCALE
is the resolution of evaluation (use 4 for Booster, 8 for LayeredFlow, 1 for others).
To evaluate StereoAnywhere with our MonoTrap dataset use this snippet:
python test_monotrap.py --datapath <DATAPATH> \
--stereomodel stereoanywhere --loadstereomodel <STEREO_MODEL_PATH> \
--monomodel DAv2 --loadmonomodel <MONO_MODEL_PATH> \
--iscale <ISCALE> --oscale <OSCALE> --normalize --iters 32 \
--vol_n_masks 8 --n_additional_hourglass 0 \
--use_aggregate_mono_vol --vol_downsample 0 \
--mirror_conf_th 0.98 --use_truncate_vol --mirror_attenuation 0.9
In this section, we present illustrative examples that demonstrate the effectiveness of our proposal.
Qualitative Results -- Zero-Shot Generalization. Predictions by state-of-the-art models and Stereo Anywhere. In particular the first row shows an extremely challenging case for SceneFlow-trained models, where Stereo Anywhere achieves accurate disparity maps thanks to VFM priors.
Qualitative results -- Zero-Shot non-Lambertian Generalization. Predictions by state-of-the-art models and Stereo Anywhere. Our proposal is the only stereo model correctly perceiving the mirror and transparent railing.
Qualitative results -- MonoTrap. The figure shows three samples where Depth Anything v2 fails while Stereo Anywhere does not.
For questions, please send an email to luca.bartolomei5@unibo.it
We would like to extend our sincere appreciation to the authors of the following projects for making their code available, which we have utilized in our work:
- We would like to thank the authors of RAFT-Stereo for providing their code, which has been inspirational for our stereo matching architecture.
- We would like to thank also the authors of Depth Anything V2 for providing their incredible monocular depth estimation network that fuels our proposal Stereo Anywhere.