Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader worker is killed in Docker #2559

Closed
karwojan opened this issue May 23, 2024 · 4 comments · Fixed by #2581
Closed

DataLoader worker is killed in Docker #2559

karwojan opened this issue May 23, 2024 · 4 comments · Fixed by #2581
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.4.x

Comments

@karwojan
Copy link

karwojan commented May 23, 2024

🐛 Bug

The problem is, that under some very specific circumstances, when I use the classification metric during the training in Docker image, my DataLoader's workers are unexpectedly killed. I'm even not sure if this is a bug in torchmetrics, as there are some very specific circumstances required to reproduce the issue. Due to the fact, that the described situation appeared after the torchmetrics version has been updated to 1.4.0, I guess that the cause of the issue is somehow related to the latest changes (there is no issue when using torchmetrics 1.2 or 1.3). After some longer investigation I still have no idea, what the issue really is, but as I am able to simply reproduce it I decided to report it.

To Reproduce

To reproduce the issue, the following code snippet has to be run in docker container with CUDA device available (cuda drivers and cuda container toolkit installed):

import os

import torch
import torchmetrics
from torch.utils.data import DataLoader

# setup envs
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "55555"

# setup torch distributed
torch.distributed.init_process_group("nccl", init_method="env://")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

# setup dataloaders
train_dl = DataLoader([1, 2, 3, 4, 5], batch_size=1, num_workers=1)
valid_dl = DataLoader([1, 2, 3, 4, 5], batch_size=1, num_workers=5)

# setup example metric
metric = torchmetrics.F1Score(task="multiclass", num_classes=3).cuda()

print("Iterate over train_dl")
# model.train()
for _ in train_dl:
    metric.update(
        torch.rand(1, 3, 40, 40, 40).cuda(),
        torch.randint(0, 3, (1, 40, 40, 40)).cuda()
    )

print("METRIC: ", metric.compute())

print("Iterate over valid_dl")
for _ in valid_dl:
    pass

print("Epoch end")

Example command to launch this script in docker, assuming the above script is in error.py file:

docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.4.0 && python /error/error.py"

When tested with the above, official image pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime, the output is:

Iterate over train_dl
METRIC:  tensor(0.3338, device='cuda:0')
Iterate over valid_dl

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
error.py 36 <module>
for _ in valid_dl:

dataloader.py 628 __next__
data = self._next_data()

dataloader.py 1316 _next_data
idx, data = self._get_data()

dataloader.py 1282 _get_data
success, data = self._try_get_data()

dataloader.py 1133 _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e

RuntimeError:
DataLoader worker (pid(s) 104) exited unexpectedly

The same output has been observed for image pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime.

When tested with the custom image on PyTorch 1.12.1+cu113 and Python 3.10 (the official pytorch 1.12.1 image uses Python 3.7 while >3.8 is required by torchmetrics), there is also CUDA error reported before DataLoader workers death:

Iterate over train_dl
METRIC:  tensor(0.3327, device='cuda:0')
Iterate over valid_dl
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1423 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f7d7263520e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x23af2 (0x7f7d9aceaaf2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x257 (0x7f7d9acef9a7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4637b8 (0x7f7dc42677b8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7d7261c7a5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f245 (0x7f7dc4163245 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x679b48 (0x7f7dc447db48 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7dc447def5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: python() [0x596fce]
frame #9: python() [0x5bdb16]
frame #10: python() [0x4e3c13]
frame #11: python() [0x594cac]
<omitting python frames>
frame #13: python() [0x583769]
frame #16: python() [0x50c14e]
frame #18: python() [0x546b93]
frame #19: python() [0x583d8a]
frame #20: python() [0x5750bf]
frame #21: python() [0x500ab4]
frame #29: python() [0x509d08]
frame #37: python() [0x509d08]
frame #43: <unknown function> + 0xb45e (0x7f7ddf5ab45e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #44: <unknown function> + 0xaae5 (0x7f7ddf5aaae5 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #45: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #46: <unknown function> + 0xb6d7 (0x7f7ddf5ab6d7 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #47: <unknown function> + 0x99ba (0x7f7ddf5a99ba in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #48: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #49: <unknown function> + 0xb6d7 (0x7f7ddf5ab6d7 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #50: <unknown function> + 0x99ba (0x7f7ddf5a99ba in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #51: <unknown function> + 0x974e (0x7f7ddf5a974e in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #52: <unknown function> + 0x134f0 (0x7f7ddf5b34f0 in /opt/conda/lib/python3.10/lib-dynload/_pickle.cpython-310-x86_64-linux-gnu.so)
frame #53: python() [0x50afcf]
frame #55: python() [0x50c14e]
frame #63: python() [0x50c3d7]


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
error.py 36 <module>
for _ in valid_dl:

dataloader.py 681 __next__
data = self._next_data()

dataloader.py 1359 _next_data
idx, data = self._get_data()

dataloader.py 1325 _get_data
success, data = self._try_get_data()

dataloader.py 1176 _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e

RuntimeError:
DataLoader worker (pid(s) 139) exited unexpectedly

This error does not appear when:

  • torchmetrics version is < 1.4.0
docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.3.0 && python /error/error.py"
  • num_workers in train_dl or valid_dl is 0 (for sure);
  • some other, nonzero, random values of num_workers are configured (e.g., 1 and 3, no pattern has been observed).

Expected behavior

There should be no errors and "Epoch end" should be printed when running this code.

Environment

  • Linux Ubuntu VERSION="18.04.5 LTS (Bionic Beaver)"
  • NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0
  • Mentioned Docker images
@karwojan karwojan added bug / fix Something isn't working help wanted Extra attention is needed labels May 23, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@Borda Borda added the v1.4.x label May 24, 2024
@SkafteNicki
Copy link
Member

Hi @karwojan, thanks for reporting this issue.
I cannot really comprehend what is going on here. My gut feeling is that this has nothing to do with torchmetrics and is more of a torch issue, but who knows. I tried it on my own machine and I could not really reproduce the issue.
Could you maybe report what happens if you run the code with the CUDA_LAUNCH_BLOCKING env set as suggested in the error message?

@karwojan
Copy link
Author

Hi @SkafteNicki, thanks for your answer.
Have you tried to run this script in docker using the following command?

docker run --rm -it --gpus=all -v `pwd`:/error pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime /bin/bash -c "pip install torchmetrics==1.4.0 && python /error/error.py"

Without Docker, I am also not able to reproduce the issue.
When I run this script with the CUDA_LAUNCH_BLOCKING env, result is exactly the same.

@SkafteNicki
Copy link
Member

SkafteNicki commented Jun 1, 2024

Hi @karwojan,
I tried some more and was finally able to reproduce the issue.
I then ran git bisect to narrow down when bug was introduced, with v1.3.2 set as a good commit and v1.4.0 set as a bad commit. In the end the bisecting resulted in this commit cd7ccfc from merging this PR #2468 being marked as the starting point of the bug.
I am still not sure what change in the PR actually is causing this, but I try to narrow it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.4.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants