Skip to content

RandCropByPosNegLabeld can lead to memory overflow #8348

@nkaenzig

Description

@nkaenzig

Describe the bug
The transform RandCropByPosNegLabeld can lead to memory overflow and the main process being killed as a consequence.
This issue has been previously reported in #1574, but back then the real cause of this was not identified nor fixed.

I already did some debugging and found that the issue stems from the optional_import call which is triggered by the floor_divide function in monai.transforms.utils_pytorch_numpy_unification. For some reason the failed imports lead to something stacking up in memory as briefly described in this PR: #8347

The callstack behind floor_divide is as follows: RandCropByPosNegLabel.__call__() -> RandCropByPosNegLabel.randomize() -> generate_pos_neg_label_crop_centers -> unravel_index -> floor_divide -> is_module_ver_at_least -> version_leq -> optional_import

While #8347 fixes this issue "superficially", I would recommend getting rid of the is_module_ver_at_least check in floor_divide. Not only has it been causing this hard to debug OOM issue, but it also slows down the datapipelines where transforms that use the floor_divide are present (because is_module_ver_at_least in those cases might be called many thousands of times while executing potentially slow import statements).

To Reproduce
Running the following python script will eventually lead to the main process being killed. On my machine this happens after around 50 iterations.

import torch
from torch.utils import data
from tqdm import tqdm
from monai.data import meta_obj
from monai.transforms.croppad.array import RandCropByPosNegLabel 

NUM_SAMPLES = 128
TRACK_META = False
RANDOM_INPUTS = False
N_EPOCHS = 10000000

if not TRACK_META:
    meta_obj.set_track_meta(False)

class DummyDataset(data.Dataset):
    def __init__(self, num_samples: int = 10000):
        self.num_samples = num_samples
        self.transform = RandCropByPosNegLabel(spatial_size=[96, 96, 96], num_samples=NUM_SAMPLES)
        generator = torch.Generator().manual_seed(0)
        self.data = torch.rand((1, 102, 294, 340), generator=generator)
        self.mask = torch.randint(0, 4, (1, 102, 294, 340), generator=generator)

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.transform(img=self.data, label=self.mask)

dataset = DummyDataset()
dataloader = data.DataLoader(dataset, batch_size=4, shuffle=False, num_workers=0)

for epoch in range(N_EPOCHS):
    for i, item in tqdm(enumerate(dataloader)):
        pass

Expected behavior
No memory accumulation / process should not be killed :)

Environment

Ensuring you use the relevant python executable, please paste the output of:

================================
Printing MONAI config...
================================
MONAI version: 1.4.0
Numpy version: 1.26.4
Pytorch version: 2.5.1
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 46a5272196a6c2590ca2589029eed8e4d56ff008
MONAI __file__: /Users/<username>/workspace/kaiko-eng-worktrees/kaiko-eng/dist/export/python/virtualenvs/python-default/3.11.10/lib/python3.11/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
ITK version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 4.0.2
scikit-image version: 0.24.0
scipy version: 1.14.1
Pillow version: 10.4.0
Tensorboard version: 2.18.0
gdown version: 5.2.0
TorchVision version: 0.20.1
tqdm version: 4.66.6
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 6.1.0
pandas version: 2.1.4
einops version: 0.8.0
transformers version: 4.48.1
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
clearml version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Darwin
Mac version: 15.3.1
Platform: macOS-15.3.1-arm64-arm-64bit
Processor: arm
Machine: arm64
Python version: 3.11.10
Process name: python3.11
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 10
Num logical CPUs: 10
Num usable CPUs: UNKNOWN for given OS
CPU usage (%): [56.7, 56.3, 73.9, 70.3, 62.9, 59.7, 66.2, 42.6, 32.9, 27.6]
CPU freq. (MHz): 3228
Load avg. in last 1, 5, 15 mins (%): [51.3, 50.3, 49.2]
Disk usage (%): 76.5
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 64.0
Available memory (GB): 51.1
Used memory (GB): 10.6

================================
Printing GPU config...
================================
Num GPUs: 0
Has CUDA: False
cuDNN enabled: False
NVIDIA_TF32_OVERRIDE: None
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: None```

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions