Stereo Depth Perception Transforms

Currently we are in progress of revamping transforms API for torchvision library. During this revamp, we are considering the components and behavior for new use cases such as detection, segmentation, and video classification. In this document, we just want to raise awareness on several transforms in depth perception quirkiness that might behave differently from the other use cases.


## Stereo Depth Perception Brief Introduction

In the Stereo Depth Perception problem, we are given **two images**, left and right, and for each pixel on the left image we want to guess the horizontal displacement of the pixel occurrence on the right image. The images are taken at the same time with two different cameras with the same upright orientation and their distance must be parallel with the horizontal axis / ground. The desired output is called **disparity map**, a 2d integer array with the same size as the image indicating how much displacement the pixel on that position is from left to right image (the unit of the displacement is in pixels). 

In a stereo depth perception dataset, they sometimes also give a **valid mask**. This is a 2d boolean array with the same size as the disparity map; the valid mask indicates whether the values in disparity maps are valid or not. Practically, we usually exclude the non-valid disparity maps from loss calculation during the training or validation.


## Components



1. Images: \
A pair of left and right image as input
2. Disparity map: \
2d integer array with same size as the image indicating the displacement of pixels on the image pair. Some dataset only give one disparity map which usually indicate the displacement from left to right image and some dataset provide a pair of left and right disparity maps.
3. Valid mask: \
2d boolean array with the same size as the disparity map indicating whether or not the disparity map value is valid and needs to be considered in loss calculation.

In general, a stereo depth perception dataset will have:
```
( 

(image_left, image_right), 

(disparity_map_left, [optional] disparity_map_right), 

([optional] valid_mask_left, [optional] valid_mask_right)

)
```

## Horizontal Flips

For horizontal flips, we do the horizontal flips on each component and then **we need to flip the order of left and right**. Supposing **HF** is an operation to do horizontal flip for a component, then:
```
horizontal_flips( (image_left, image_right), (disparity_map_left, disparity_map_right), (valid_mask_left, valid_mask_right) ) = 

(HF(image_right), HF(image_left)), (HF(disparity_map_right), HF(disparity_map_left)), (HF(valid_mask_right), HF(valid_mask_left))
```
The new behavior on this transform is that we need to flip the ordering of left and right which did not exist in previous use cases.

Here are a code reference: [https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L344](https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L344)


## Resize

For resizing, we resize the image normally, but for the disparity maps we need to adjust the value after the resize by multiplying the resizing scale factor and then convert the value into the nearest integer.

Remember that disparity map is the pixel displacement and the unit of the displacement is in how many pixels. If we resize the image, then the pixel displacement will also scaled with the same scaling factor of the resizing.

Take note that the flow from OpticalFlow behave similarly in resize as well.

Here is the code reference: [https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L370](https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L370)


## Erase / Occlusion

For erase or occlusion, commonly we only apply the operation on the right image without affecting other components (no transformation happens on left image, disparity maps, and valid masks).

Here are code references:

[https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L261](https://github.com/pytorch/vision/blob/17ace9557f95be6dfa6683bd8b6c002bbcb668df/references/stereo_matching/transforms.py#L261)

[https://github.com/princeton-vl/RAFT-Stereo/blob/main/core/utils/augmentor.py#L109](https://github.com/princeton-vl/RAFT-Stereo/blob/main/core/utils/augmentor.py#L109)

[https://github.com/megvii-research/CREStereo/blob/ad3a1613bdedd88b93247e5f002cb7c80799762d/dataset.py#L155](https://github.com/megvii-research/CREStereo/blob/ad3a1613bdedd88b93247e5f002cb7c80799762d/dataset.py#L155)


cc @vfdev-5 @datumbox @pmeier @TeodorPoncu 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stereo Depth Perception Transforms #6495

Stereo Depth Perception Brief Introduction

Components

Horizontal Flips

Resize

Erase / Occlusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stereo Depth Perception Transforms #6495

Description

Stereo Depth Perception Brief Introduction

Components

Horizontal Flips

Resize

Erase / Occlusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions