Pytorch Profiler causes memory leak #10717

nils-werner · 2021-11-23T20:09:28Z

🐛 Bug

It seems like chosing the Pytorch profiler causes an ever growing amount of RAM being allocated. This even continues after training, probably while the profiler data is processed.

After a certain number of epochs, this causes an OOM and triggers my Kernel to kill the process.

To Reproduce

To reproduce, simply enable the profiler on one of the provided examples

cd pl_examples/basic_examples/mnist_examples
python image_classifier_5_lightning_datamodule.py --trainer.profiler=pytorch --trainer.gpus=1

On my machine, sometime mid epoch=3, I am OOM and the process gets killed.

Expected behavior

The memory leak does not occur

Environment

 CUDA:
        - GPU:
                - NVIDIA GeForce GTX 1060 6GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.21.4
        - pyTorch_debug:     False
        - pyTorch_version:   1.10.0+cu102
        - pytorch-lightning: 1.6.0dev
        - tqdm:              4.62.3
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         
        - python:            3.8.12
        - version:           #1 ZEN SMP PREEMPT Thu, 18 Nov 2021 22:23:53 +0000

Additional context

I am aware that this might be caused by Pytorch and not Lightning and I am currently trying to reproduce this issue in plain Pytorch. If I can reproduce it, this issue can of course be triaged to them.

cc @tchaton @carmocca @kaushikb11 @ninginthecloud

The text was updated successfully, but these errors were encountered:

nils-werner · 2021-11-24T09:47:36Z

I have noticed the same issue in plain Pytorch when using torch.autograd.profiler.profile() outside of nn.Modules, i.e. when it also contains data loading

for epoch in range(epochs):
    with torch.autograd.profiler.profile(use_cuda=True) as prof:
        train()
        test()

but it is OK if you use it in your nn.Module, i.e. when you are only profiling the math ops

class Net(nn.Module):
    # ...
    def forward(self, x):
        with torch.autograd.profiler.profile(use_cuda=True) as prof:
            # ...

nils-werner · 2021-11-24T10:32:25Z

I can reproduce a memory leak in long-running profiling tasks using torch.profiler.profile(), too:

for epoch in range(epochs):
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
    ) as prof:
        train()
        test()

which can be prevented by using a schedule

for epoch in range(epochs):
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
        schedule=torch.profiler.schedule(
            wait=1,
            warmup=1,
            active=2
        ),
    ) as prof:
        train()
        test()

nils-werner · 2021-11-24T12:09:58Z

I am just stabbing at the sourcecode a bit here: If I remove the block profiler/pytorch.py:415-421

# the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
    warning_cache.warn(
        "The PyTorch Profiler default schedule will be overridden as there is not enough "
        "steps to properly record traces."
    )
    self._schedule = None
    self.profiler.schedule = torch.profiler.profiler._default_schedule_fn

The MNIST example immediately consumes 4.6GB of RAM, but does not seem to leak it.

Note that disabling the profiler entirely the training only uses 380MB of RAM.

nils-werner · 2021-11-24T12:25:00Z

In general I find it a little bit strange that self._schedule is changed in PyTorchProfiler.stop(). start() and stop() are called repeatedly during training (once per batch?), which means the schedule changes after the first batch:

If I put a print(self._schedule) directly at the beginning of stop(), I see the following output:

Training: 0it [00:00, ?it/s]
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
Epoch 0:   0%|                               | 0/1875 [00:00<?, ?it/s]
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
# ...
/home/nils/Arbeit/repro/pytorch-lightning/pytorch_lightning/profiler/pytorch.py:417: UserWarning: The PyTorch Profiler default schedule will be overridden as there is not enough steps to properly record traces.
  warning_cache.warn(
None
Epoch 0:   0%|                               | 1/1875 [00:00<01:53, 16.53it/s, loss=2.33, v_num=31]
None
None
# ...

nils-werner · 2021-11-24T12:31:30Z

Ok, and if move the entire block

# the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
    warning_cache.warn(
        "The PyTorch Profiler default schedule will be overridden as there is not enough "
        "steps to properly record traces."
    )
    self._schedule = None
    self.profiler.schedule = torch.profiler.profiler._default_schedule_fn

out of stop() and to the end of _init_kineto() the schedule remains constant during training and the leak is gone. Note that I am still just poking at the sourcecode here and am not sure if _init_kineto() is indeed the correct place for this block.

…ightning-AI#10717

rohitgr7 · 2021-11-30T13:35:12Z

@nils-werner thanks for raising this issue and for the pointers. 😃
Can you try installing the PR branch and check if the issue still exists??

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@fix/pt_prof_leak

nils-werner · 2021-12-02T13:17:29Z

Yes, this PR fixes the issue at my end.

nils-werner added the bug Something isn't working label Nov 23, 2021

rohitgr7 added the profiler label Nov 24, 2021

tchaton added the priority: 0 High priority task label Nov 24, 2021

nils-werner added a commit to nils-werner/pytorch-lightning that referenced this issue Nov 24, 2021

Move PyTorch Profiler schedule overriding into init routines. Fixes L…

46f1357

…ightning-AI#10717

rohitgr7 mentioned this issue Nov 30, 2021

Fix schedule reset logic in pytorch profiler #10837

Merged

12 tasks

rohitgr7 closed this as completed in #10837 Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch Profiler causes memory leak #10717

Pytorch Profiler causes memory leak #10717

nils-werner commented Nov 23, 2021 •

edited by github-actions bot

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

rohitgr7 commented Nov 30, 2021

nils-werner commented Dec 2, 2021

Pytorch Profiler causes memory leak #10717

Pytorch Profiler causes memory leak #10717

Comments

nils-werner commented Nov 23, 2021 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

nils-werner commented Nov 24, 2021 • edited Loading

nils-werner commented Nov 24, 2021 • edited Loading

nils-werner commented Nov 24, 2021 • edited Loading

nils-werner commented Nov 24, 2021 • edited Loading

nils-werner commented Nov 24, 2021 • edited Loading

rohitgr7 commented Nov 30, 2021

nils-werner commented Dec 2, 2021

nils-werner commented Nov 23, 2021 •

edited by github-actions bot

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading

nils-werner commented Nov 24, 2021 •

edited

Loading