Confused by dimension order in MONAI. #4872

GYDDHPY · 2022-08-09T09:08:36Z

GYDDHPY
Aug 9, 2022

I was reading SWIN UNETR code and was confused by the dimension order used in the model.

For BTCV dataset, the preprocessing method used in SWIN UNETR are similiar to UNETR ,which include:
LoadImaged,AddChanneld,Orientationd,Spacingd,ScaleIntensityRanged,CropForegroundd,RandCropByPosNegLabeld,RandFlipd,RandFlipd,RandFlipd,RandRotate90d,RandScaleIntensityd,RandShiftIntensityd,ToTensord.

The dimension order of the image after using these transforms seems to be [h w d].

However, in the PatchEmbed module of SwinTransformer, the code directly used _, _, d, h, w = x_shape, which treat the image as the dimension order changed to be [d h w].

MONAI/monai/networks/blocks/patchembedding.py

Lines 122 to 180 in 456caed

    
           class PatchEmbed(nn.Module): 
        
               """ 
        
               Patch embedding block based on: "Liu et al., 
        
               Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 
        
               <https://arxiv.org/abs/2103.14030>" 
        
               https://github.com/microsoft/Swin-Transformer 
        
               Unlike ViT patch embedding block: (1) input is padded to satisfy window size requirements (2) normalized if 
        
               specified (3) position embedding is not used. 
        
               Example:: 
        
                   >>> from monai.networks.blocks import PatchEmbed 
        
                   >>> PatchEmbed(patch_size=2, in_chans=1, embed_dim=48, norm_layer=nn.LayerNorm, spatial_dims=3) 
        
               """ 
        
               def __init__( 
        
                   self, 
        
                   patch_size: Union[Sequence[int], int] = 2, 
        
                   in_chans: int = 1, 
        
                   embed_dim: int = 48, 
        
                   norm_layer: Type[LayerNorm] = nn.LayerNorm, 
        
                   spatial_dims: int = 3, 
        
               ) -> None: 
        
                   """ 
        
                   Args: 
        
                       patch_size: dimension of patch size. 
        
                       in_chans: dimension of input channels. 
        
                       embed_dim: number of linear projection output channels. 
        
                       norm_layer: normalization layer. 
        
                       spatial_dims: spatial dimension. 
        
                   """ 
        
                   super().__init__() 
        
                   if not (spatial_dims == 2 or spatial_dims == 3): 
        
                       raise ValueError("spatial dimension should be 2 or 3.") 
        
                   patch_size = ensure_tuple_rep(patch_size, spatial_dims) 
        
                   self.patch_size = patch_size 
        
                   self.embed_dim = embed_dim 
        
                   self.proj = Conv[Conv.CONV, spatial_dims]( 
        
                       in_channels=in_chans, out_channels=embed_dim, kernel_size=patch_size, stride=patch_size 
        
                   ) 
        
                   if norm_layer is not None: 
        
                       self.norm = norm_layer(embed_dim) 
        
                   else: 
        
                       self.norm = None 
        
               def forward(self, x): 
        
                   x_shape = x.size() 
        
                   if len(x_shape) == 5: 
        
                       _, _, d, h, w = x_shape 
        
                       if w % self.patch_size[2] != 0: 
        
                           x = F.pad(x, (0, self.patch_size[2] - w % self.patch_size[2])) 
        
                       if h % self.patch_size[1] != 0: 
        
                           x = F.pad(x, (0, 0, 0, self.patch_size[1] - h % self.patch_size[1])) 
        
                       if d % self.patch_size[0] != 0: 
        
                           x = F.pad(x, (0, 0, 0, 0, 0, self.patch_size[0] - d % self.patch_size[0]))

This confused me a lot. Is there any procedure changed the dimension order that I did't notice or something else?

Answered by wyli

Aug 9, 2022

In my understanding as long as the model is consistently used in training and inference, there's no need to distinguish the first/second/third spatial dimensions, these d, h, w are just variable names. the only problem I can see is that if after some training, the model is used together with a different preprocessing pipeline built outside of monai. in this case the pipeline should be created carefully so that it consistently reproduces the training ones.

View full answer

wyli · 2022-08-09T10:18:54Z

wyli
Aug 9, 2022
Collaborator

In my understanding as long as the model is consistently used in training and inference, there's no need to distinguish the first/second/third spatial dimensions, these d, h, w are just variable names. the only problem I can see is that if after some training, the model is used together with a different preprocessing pipeline built outside of monai. in this case the pipeline should be created carefully so that it consistently reproduces the training ones.

3 replies

GYDDHPY Aug 9, 2022
Author

Thank you for your quick reply. I'm trying to understand your answer.

The preprocessing method is designed for specific problem, for example, the patch size of each dimension is set for different axis. But the further variable order, such as d h w, is not required to be strictly consistent with the image axis. Because the complicate neural network will deel with such design. All we need to be careful is to make the procedure of training and inference to be consistent.

Is my understanding correct?

By the way, does MONAI also have this in mind during the design process?

wyli Aug 9, 2022
Collaborator

correct, and yes, the metadata dictionary and more recently metatensor are trying to capture the additional information associated with the image arrays such as spacing and orientation, then with the preprocessing transforms such as Spacing and Orientation will normalize the input data in a consistent manner. if these preprocessing are used for training, then the model weights become conditioned on these steps. one exception is the anisotropic models (with 3D conv kernels such as 3x3x1), for example

MONAI/monai/networks/nets/ahnet.py

Line 50 in 456caed

kernel_size=(3, 3, 1)[-spatial_dims:],

it typically assumes 2d thick slices as input where the first two spatial dimensions have high resolution and the third dimension has large spacing.

GYDDHPY Aug 9, 2022
Author

Thank you for your quick reply again.

Your answer helps me better understand the design of MONAI project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Confused by dimension order in MONAI. #4872

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Confused by dimension order in MONAI. #4872

Uh oh!

Uh oh!

GYDDHPY Aug 9, 2022

Replies: 1 comment · 3 replies

Uh oh!

wyli Aug 9, 2022 Collaborator

Uh oh!

GYDDHPY Aug 9, 2022 Author

Uh oh!

wyli Aug 9, 2022 Collaborator

Uh oh!

GYDDHPY Aug 9, 2022 Author

GYDDHPY
Aug 9, 2022

Replies: 1 comment 3 replies

wyli
Aug 9, 2022
Collaborator

GYDDHPY Aug 9, 2022
Author

wyli Aug 9, 2022
Collaborator

GYDDHPY Aug 9, 2022
Author