PRISMA is a computational photography pipeline that performs multiple inferences (refere as "bands") from any image or video. Like light pasing through a prism that bends it into different wavelengths, this pipeline expands images into data that can be use for 3D reconstruction or realtime post-processing operations.
It's a combination of different algorithms and open sourced pre-train models such as:
- Monocular
depth
(MiDAS v3.1, ZoeDepth, Marigold, PatchFusion, Depth_Anything) - Optical
flow
(RAFT, GMFlow) - Segmentation
mask
(mmdet) camera pose
(COLMAP)
The resulting bands are stored in a folder with the same name as the input file. Each band is stored as a single .png
or .mp4
file. And can be imported on:
- Estimated depth can be importer to Blender projects using this blender project, also COLMAP scenes can be imported using this addon
- GlslViewer for applying real-time shaders
- Videos can be use for both NeRFs (like NVidia's Instant-ngp) or Gaussian Splatting training.
Notes:
- Infered depth is exported by default as a heatmap that can be decoded realtime using LYGIA's heatmap GLSL/HLSL sampling.
- optical flow is encoded as HUE (angle) and saturation which also can be decoded realtime using LYGIA opticalFlow GLSL/HLSL sampler.
Main dependencies:
git clone git@github.com:patriciogonzalezvivo/prisma.git
cd prisma
conda env create -f environment.yml
conda activate prisma
sh download_models.sh
# Install mmcv (for mmdetection)
pip install -U openmim
mim install mmengine
mim install "mmcv-full==1.7.1"
We start by processing an image or video. Let's start by processing an image:
python process.py -i data/gog.jpg
With out providing an --output
this will create a folder with the same filename which will contain all the derived bands (rgba
, flow
, mask
and depth_*
).
gog.jpg
gog/
├── depth_patchfusion.png
├── mask.png
├── metadata.json
└── rgba.png
In the forlder you will find a metadata.json
file that contains all the metadata associated with the original image or video.
{
"bands": {
"rgba": {
"url": "rgba.png"
},
"depth_patchfusion": {
"url": "depth_patchfusion.png",
"values": {
"min": {
"value": 1.6147574186325073,
"type": "float"
},
"max": {
"value": 11.678544044494629,
"type": "float"
}
}
},
"mask": {
"url": "mask.png",
"ids": [
"person",
"bird",
"cat",
"dog",
"horse",
"sheep",
"cow",
"elephant",
"bear",
"zebra",
"giraffe"
]
}
},
"width": 934,
"height": 440,
"principal_point": [
467.0,
220.0
],
"focal_length": 641.0616195031489,
"field_of_view": 37.88246641919117
}
Currently PRISMA supports multiple depth estimation algorithms. You can select which one to use by providing the --depth
|-d
argument: depth_midas
, depth_zoedepth
, depth_patchfusion
, depth_marigold
or all
. By defualt images will be processed using depth_patchfusion
, while videos will use depth_anything
.
When processing videos, by default PRISMA creates the least ammount of data by creating a single .png
or .mp4
for each band. In the case of videos data like min/max values will be stored on .cvs
.
it's possible to save extra data by setting the --extra
|-e
level number.
- store bands as a single
.png
and.mp4
(video have usually an associated.csv
file) - store images as
.ply
point clouds, for videos it extracts the reslting frames as.png
- store optical flow from videos as
.flo
files. - store inferenced depth as
.npy
files.
Let's try now extracting all depth models and individual frames from a video:
python process.py -i data/rocky.mp4 -d all -e 1
Which produce the folowing folder structure:
rocky.mp4
rocky/
├── depth_anything/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── depth_anything_max.csv
├── depth_anything_min.csv
├── depth_anything.mp4
├── depth_marigold/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── depth_marigold_max.csv
├── depth_marigold_min.csv
├── depth_marigold.mp4
├── depth_midas/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── depth_midas_max.csv
├── depth_midas_min.csv
├── depth_midas.mp4
├── depth_patchfusion/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── depth_patchfusion_max.csv
├── depth_patchfusion_min.csv
├── depth_patchfusion.mp4
├── depth_zoedepth/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── depth_zoedepth_max.csv
├── depth_zoedepth_min.csv
├── depth_zoedepth.mp4
├── flow_raft/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── flow_raft.csv
├── flow_raft.mp4
├── flow_gmflow/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── flow_gmflow.csv
├── flow_gmflow.mp4
├── images/
│ ├── 000000.png
│ ├── 000001.png
│ ├── ...
│ └── 000110.png
├── mask/
| ├── 000000.png
| ├── 000001.png
| ├── ...
| └── 000110.png
├── mask.mp4
|── sparse/
| └── 0/
| ├── cameras.bin
| ├── images.bin
| ├── points3D.bin
| └── points3D.txt
|── camera_pose.csv
|── colmap.db
├── metadata.json
└── rgba.mp4
View the resulting bands from the processed image/video using ReRun:
```bash
python view.py -i data/rocky
In order to export the bands as a single image or video you can use the concat.py
script:
python concat.py -i data/gog -o test.png
This pipeline is Copyright (c) 2024, Patricio Gonzalez Vivo and Licensed under CC BY-NC-SA 4.0 please reach out to patriciogonzalezvivo at gmail dot com, for getting a comercial license.
All the models and software used by it are commercial ready licenses like MIT, Apache and BSD.
Paper: Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
License: MIT
Code Repo: isl-org/MiDaS
Use:
depth_midas.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Citation:
@ARTICLE {Ranftl2022,
author = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
title = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
year = "2022",
volume = "44",
number = "3"
}
Citation for DPT-based model:
@article{Ranftl2021,
author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
title = {Vision Transformers for Dense Prediction},
journal = {ArXiv preprint},
year = {2021},
}
Paper: Zero-shot Transfer by Combining Relative and Metric Depth
License: MIT
Code Repo: isl-org/ZoeDepth
Use:
depth_zoedepth.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Citation
@misc{https://doi.org/10.48550/arxiv.2302.12288,
doi = {10.48550/ARXIV.2302.12288},
url = {https://arxiv.org/abs/2302.12288},
author = {Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and Müller, Matthias},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth},
publisher = {arXiv},
year = {2023},
copyright = {arXiv.org perpetual, non-exclusive license}
}
License: MIT
Code Repo: zhyever/PatchFusion
Use:
depth_patchfusion.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Note: This pretrained model needs to be downloaded and placed in the models/
folder.
Citation
@article{li2023patchfusion,
title={PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation},
author={Zhenyu Li and Shariq Farooq Bhat and Peter Wonka},
year={2023},
eprint={2312.02284},
archivePrefix={arXiv},
primaryClass={cs.CV}}
Paper: Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
License: Apache
Code Repo: prs-eth/Marigold
Use:
depth_marigold.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Citation
@misc{ke2023repurposing,
title={Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation},
author={Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler},
year={2023},
eprint={2312.02145},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Paper: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
License: Apache
Code Repo: LiheYoung/Depth-Anything
Use:
depth_anything.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Citation
@article{depthanything,
title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
journal={arXiv:2401.10891},
year={2024}
}
Based on https://github.com/SharifElfouly/opical-flow-estimation-with-RAFT
Seems to be very good: Optical Flow Estimation Benchmark
Paper: RAFT: Recurrent All Pairs Field Transforms for Optical Flow
License: BSD
Code Repo: princeton-vl/RAFT
Use:
flow_raft.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Paper: GMFlow: Learning Optical Flow via Global Matching
License: Apache
Code Repo: haofeixu/gmflow
Use:
flow_gmflow.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Code Repo: MMDetection
License: Apache
Use:
mask_mmdet.py --input <IMAGE/VIDEO> --output <IMAGE/VIDEO>
Citation:
@article{mmdetection,
title = {{MMDetection}: Open MMLab Detection Toolbox and Benchmark},
author = {Chen, Kai and Wang, Jiaqi and Pang, Jiangmiao and Cao, Yuhang and
Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and
Liu, Ziwei and Xu, Jiarui and Zhang, Zheng and Cheng, Dazhi and
Zhu, Chenchen and Cheng, Tianheng and Zhao, Qijie and Li, Buyu and
Lu, Xin and Zhu, Rui and Wu, Yue and Dai, Jifeng and Wang, Jingdong
and Shi, Jianping and Ouyang, Wanli and Loy, Chen Change and Lin, Dahua},
journal= {arXiv preprint arXiv:1906.07155},
year={2019}
}