Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stereo Depth Perception Transforms #6495

Open
YosuaMichael opened this issue Aug 25, 2022 · 1 comment
Open

Stereo Depth Perception Transforms #6495

YosuaMichael opened this issue Aug 25, 2022 · 1 comment

Comments

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Aug 25, 2022

Currently we are in progress of revamping transforms API for torchvision library. During this revamp, we are considering the components and behavior for new use cases such as detection, segmentation, and video classification. In this document, we just want to raise awareness on several transforms in depth perception quirkiness that might behave differently from the other use cases.

Stereo Depth Perception Brief Introduction

In the Stereo Depth Perception problem, we are given two images, left and right, and for each pixel on the left image we want to guess the horizontal displacement of the pixel occurrence on the right image. The images are taken at the same time with two different cameras with the same upright orientation and their distance must be parallel with the horizontal axis / ground. The desired output is called disparity map, a 2d integer array with the same size as the image indicating how much displacement the pixel on that position is from left to right image (the unit of the displacement is in pixels).

In a stereo depth perception dataset, they sometimes also give a valid mask. This is a 2d boolean array with the same size as the disparity map; the valid mask indicates whether the values in disparity maps are valid or not. Practically, we usually exclude the non-valid disparity maps from loss calculation during the training or validation.

Components

  1. Images:
    A pair of left and right image as input
  2. Disparity map:
    2d integer array with same size as the image indicating the displacement of pixels on the image pair. Some dataset only give one disparity map which usually indicate the displacement from left to right image and some dataset provide a pair of left and right disparity maps.
  3. Valid mask:
    2d boolean array with the same size as the disparity map indicating whether or not the disparity map value is valid and needs to be considered in loss calculation.

In general, a stereo depth perception dataset will have:

( 

(image_left, image_right), 

(disparity_map_left, [optional] disparity_map_right), 

([optional] valid_mask_left, [optional] valid_mask_right)

)

Horizontal Flips

For horizontal flips, we do the horizontal flips on each component and then we need to flip the order of left and right. Supposing HF is an operation to do horizontal flip for a component, then:

horizontal_flips( (image_left, image_right), (disparity_map_left, disparity_map_right), (valid_mask_left, valid_mask_right) ) = 

(HF(image_right), HF(image_left)), (HF(disparity_map_right), HF(disparity_map_left)), (HF(valid_mask_right), HF(valid_mask_left))

The new behavior on this transform is that we need to flip the ordering of left and right which did not exist in previous use cases.

Here are a code reference:

return ((img_right, img_left), (dsp_right, dsp_left), (mask_right, mask_left))

Resize

For resizing, we resize the image normally, but for the disparity maps we need to adjust the value after the resize by multiplying the resizing scale factor and then convert the value into the nearest integer.

Remember that disparity map is the pixel displacement and the unit of the displacement is in how many pixels. If we resize the image, then the pixel displacement will also scaled with the same scaling factor of the resizing.

Take note that the flow from OpticalFlow behave similarly in resize as well.

Here is the code reference:

resized_disparities += (F.resize(dsp, self.size) * scale_x,)

Erase / Occlusion

For erase or occlusion, commonly we only apply the operation on the right image without affecting other components (no transformation happens on left image, disparity maps, and valid masks).

Here are code references:

right_image = F.erase(right_image, x, y, h, w, v, self.inplace)

https://github.com/princeton-vl/RAFT-Stereo/blob/main/core/utils/augmentor.py#L109

https://github.com/megvii-research/CREStereo/blob/ad3a1613bdedd88b93247e5f002cb7c80799762d/dataset.py#L155

cc @vfdev-5 @datumbox @pmeier @TeodorPoncu

@datumbox
Copy link
Contributor

Thanks for the overview @YosuaMichael.

It's worth noting that this is a somewhat abstract view of the Stereo Depth Perception case. There are more nuances (float vs integer maps, signed maps etc) but these give a very good picture.

I think this would be a great food for thought and discussion to see how the new Transforms API can adjust to new use-cases that wasn't designed for. @pmeier and @vfdev-5 not a top priority but worth looking at some of the details listed here to see how we could potentially support them. I think that most of the things described here are possible but there might be details worth looking into when we decide to bring this as a standalone project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants