Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] corrupted dataset due to simultaneous downloading by all ranks. #3065

Open
LamForest opened this issue Sep 29, 2024 · 0 comments · May be fixed by #3066
Open

[BUG] corrupted dataset due to simultaneous downloading by all ranks. #3065

LamForest opened this issue Sep 29, 2024 · 0 comments · May be fixed by #3066
Labels

Comments

@LamForest
Copy link

Add Link

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Describe the bug

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 9912422/9912422 [00:03<00:00, 3078874.05it/s]

  5%|█████▎                                                                                                    | 491520/9912422 [00:01<00:22, 417952.41it/s]Traceback (most recent call last):
  File "fsdp_mnist.py", line 173, in <module>
    mp.spawn(fsdp_main,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/ssd1/gaotianlin/baidu/hac-aiacc/Megatron/old_scripts/fsdp/fsdp_mnist.py", line 94, in fsdp_main
    dataset1 = datasets.MNIST('./data', train=True, download=True,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
    self.download()
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 187, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 434, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 155, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

/root/miniconda3/envs/old_mega/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Describe your environment

...

@LamForest LamForest added the bug label Sep 29, 2024
@LamForest LamForest linked a pull request Sep 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant