-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch Profiler causes memory leak #10717
Comments
I have noticed the same issue in plain Pytorch when using for epoch in range(epochs):
with torch.autograd.profiler.profile(use_cuda=True) as prof:
train()
test() but it is OK if you use it in your class Net(nn.Module):
# ...
def forward(self, x):
with torch.autograd.profiler.profile(use_cuda=True) as prof:
# ... |
I can reproduce a memory leak in long-running profiling tasks using for epoch in range(epochs):
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
train()
test() which can be prevented by using a schedule for epoch in range(epochs):
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2
),
) as prof:
train()
test() |
I am just stabbing at the sourcecode a bit here: If I remove the block # the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
warning_cache.warn(
"The PyTorch Profiler default schedule will be overridden as there is not enough "
"steps to properly record traces."
)
self._schedule = None
self.profiler.schedule = torch.profiler.profiler._default_schedule_fn The MNIST example immediately consumes 4.6GB of RAM, but does not seem to leak it. Note that disabling the profiler entirely the training only uses 380MB of RAM. |
In general I find it a little bit strange that If I put a
|
Ok, and if move the entire block # the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
warning_cache.warn(
"The PyTorch Profiler default schedule will be overridden as there is not enough "
"steps to properly record traces."
)
self._schedule = None
self.profiler.schedule = torch.profiler.profiler._default_schedule_fn out of |
@nils-werner thanks for raising this issue and for the pointers. 😃
|
Yes, this PR fixes the issue at my end. |
🐛 Bug
It seems like chosing the Pytorch profiler causes an ever growing amount of RAM being allocated. This even continues after training, probably while the profiler data is processed.
After a certain number of epochs, this causes an OOM and triggers my Kernel to kill the process.
To Reproduce
To reproduce, simply enable the profiler on one of the provided examples
cd pl_examples/basic_examples/mnist_examples python image_classifier_5_lightning_datamodule.py --trainer.profiler=pytorch --trainer.gpus=1
On my machine, sometime mid epoch=3, I am OOM and the process gets killed.
Expected behavior
The memory leak does not occur
Environment
Additional context
I am aware that this might be caused by Pytorch and not Lightning and I am currently trying to reproduce this issue in plain Pytorch. If I can reproduce it, this issue can of course be triaged to them.
cc @tchaton @carmocca @kaushikb11 @ninginthecloud
The text was updated successfully, but these errors were encountered: