training hangs with lightning ddp and cloud dir? #408

rxqy · 2024-11-01T03:47:57Z

🐛 Bug

Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.

I tried several different configurations, but I'm not sure what I should check next.
GPU / Strategy / FileOn / results
1 / No DDP/ local ssd / OK
1 / No DDP/ remote(s3) / OK
8 / DDP/ local ssd / OK
8 / DDP/ remote(s3) / Stuck.

To Reproduce

I'm following the exact steps on the imagenet demo. And I write a trainer myself here.
Just run python train.py with different CUDA_VISIBLE_DEVICES is enough

Code sample

# train.py
import numpy as np
import lightning as L
import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F

class LitAutoEncoder(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.decoder = nn.Sequential(nn.Linear(32, 128))

    def training_step(self, batch, batch_idx):
        loss = self.decoder(batch).mean()
        print(self.trainer.local_rank, loss)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


from lightning.data import StreamingDataset, StreamingDataLoader

class ImageNetStreaming(StreamingDataset):
    def __init__(self, ):
        if 1:
            input_dir = "s3:// xxxxx"
            cache_dir = None
        else:
            input_dir = "val"
            cache_dir = None

        max_cache_size = "200GB"
        super().__init__(
            input_dir = input_dir,
            max_cache_size = max_cache_size,
            shuffle = True,
        )

    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        return np.float32(123.)

dataset = ImageNetStreaming()
dataloader = StreamingDataLoader(
    dataset,
    batch_size = 32,
    num_workers = 2,
    pin_memory = True,
    shuffle = True,
    drop_last = True
)

autoencoder = LitAutoEncoder()
trainer = L.Trainer()
trainer.fit(autoencoder, dataloader)

Expected behavior

Training should finish

Additional context

Due to some regulations here we can not put we data or training scirpts on lightning-studio. I'm not sure if something's wrong with our s3 bucket or our our network configuration.
One thing I notice is that even if the training stucks at some iterations(<50), we can still observe large network throughputs on our machine (around 100mb/s), but the local chunk directory( ~/.lightning/chunks) stops growing.

Current environment

CUDA:
- GPU:
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
- available: True
- version: 12.1
Lightning:
- lightning: 2.3.0
- lightning-utilities: 0.11.1
- pytorch-lightning: 2.2.1
- torch: 2.2.1
- torchaudio: 2.2.1
- torchmetrics: 1.3.2
- torchvision: 0.17.1
Packages:
- absl-py: 2.1.0
- accelerate: 0.30.1
- aiofiles: 23.2.1
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- angle-emb: 0.3.10
- annotated-types: 0.7.0
- anyio: 4.4.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- auto-gptq: 0.7.1
- av: 12.3.0
- awscli: 1.32.70
- backports-datetime-fromisoformat: 2.0.1
- bitsandbytes: 0.43.1
- blessed: 1.20.0
- blinker: 1.7.0
- boltons: 24.0.0
- boto3: 1.34.143
- botocore: 1.34.143
- braceexpand: 0.1.7
- brotli: 1.0.9
- certifi: 2024.2.2
- charset-normalizer: 2.0.4
- click: 8.1.7
- colorama: 0.4.4
- coloredlogs: 15.0.1
- contourpy: 1.2.1
- cos-python-sdk-v5: 1.9.30
- crcmod: 1.7
- cycler: 0.12.1
- datasets: 2.14.6
- decord: 0.6.0
- deepspeed: 0.14.0
- dill: 0.3.7
- dnspython: 2.6.1
- docker-pycreds: 0.4.0
- docstring-parser: 0.16
- docutils: 0.16
- einops: 0.7.0
- email-validator: 2.2.0
- et-xmlfile: 1.1.0
- exceptiongroup: 1.2.2
- faiss-gpu: 1.7.2
- fastapi: 0.111.1
- fastapi-cli: 0.0.4
- ffmpy: 0.3.2
- filelock: 3.13.1
- fire: 0.6.0
- flash-attn: 2.5.7
- flask: 3.0.3
- fonttools: 4.51.0
- frozenlist: 1.4.1
- fsspec: 2023.10.0
- gekko: 1.2.1
- gitdb: 4.0.11
- gitpython: 3.1.43
- gmpy2: 2.1.2
- gpustat: 1.1.1
- gradio: 4.39.0
- gradio-client: 1.1.1
- grpcio: 1.62.1
- h11: 0.14.0
- hide-warnings: 0.17
- hjson: 3.1.0
- httpcore: 1.0.5
- httptools: 0.6.1
- httpx: 0.27.0
- huggingface-hub: 0.23.4
- humanfriendly: 10.0
- idna: 3.4
- importlib-resources: 6.4.0
- influxdb: 5.3.2
- itsdangerous: 2.1.2
- jinja2: 3.1.3
- jmespath: 1.0.1
- joblib: 1.4.0
- jsonargparse: 4.27.7
- kafka-python: 2.0.2
- kiwisolver: 1.4.5
- lightning: 2.3.0
- lightning-utilities: 0.11.1
- litdata: 0.2.29
- llava: 1.7.0.dev0
- llmtuner: 0.6.3.dev0
- m3u8: 4.0.0
- markdown: 3.6
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.4
- mdurl: 0.1.2
- media-metric: 0.2.0.10
- mkl-fft: 1.3.8
- mkl-random: 1.2.4
- mkl-service: 2.4.0
- mmidls: 2.0.3
- mpmath: 1.3.0
- msgpack: 1.1.0
- multidict: 6.0.5
- multiprocess: 0.70.15
- networkx: 3.1
- ninja: 1.11.1.1
- nssdk: 0.0.1
- numpy: 1.26.4
- nvidia-ml-py: 12.535.133
- onnx: 1.16.0
- onnxconverter-common: 1.14.0
- opencv-python-headless: 4.9.0.80
- openpyxl: 3.1.5
- optimum: 1.21.1
- orjson: 3.10.6
- packaging: 24.0
- pandas: 2.2.1
- peft: 0.11.1
- pillow: 10.2.0
- pip: 23.3.1
- platformdirs: 4.2.2
- ply: 3.11
- prettytable: 3.10.0
- protobuf: 3.20.2
- psutil: 5.9.8
- py: 1.11.0
- py-cpuinfo: 9.0.0
- pyarrow: 15.0.2
- pyarrow-hotfix: 0.6
- pyasn1: 0.5.1
- pycryptodome: 3.20.0
- pydantic: 2.7.1
- pydantic-core: 2.18.2
- pydub: 0.25.1
- pygments: 2.18.0
- pynvml: 11.5.0
- pyparsing: 3.1.2
- pyrootutils: 1.0.4
- pysocks: 1.7.1
- python-dateutil: 2.9.0.post0
- python-dotenv: 1.0.1
- python-multipart: 0.0.9
- pytorch-lightning: 2.2.1
- pytz: 2024.1
- pyyaml: 6.0.1
- redis: 5.0.3
- regex: 2023.12.25
- requests: 2.31.0
- rich: 13.7.1
- rocketmq-client-python: 2.0.0
- rouge: 1.0.1
- rsa: 4.7.2
- ruff: 0.5.4
- s3transfer: 0.10.1
- safetensors: 0.4.2
- scikit-learn: 1.4.2
- scipy: 1.13.0
- seaborn: 0.13.2
- semantic-version: 2.10.0
- sentencepiece: 0.2.0
- sentry-sdk: 2.5.1
- setproctitle: 1.3.3
- setuptools: 68.2.2
- shellingham: 1.5.4
- shtab: 1.7.1
- six: 1.16.0
- smmap: 5.0.1
- sniffio: 1.3.1
- sse-starlette: 2.1.2
- starlette: 0.37.2
- sympy: 1.12
- tabulate: 0.9.0
- taxonomy: 0.10.0
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.2
- termcolor: 2.4.0
- threadpoolctl: 3.4.0
- thrift: 0.20.0
- thriftpy2: 0.4.20
- tiktoken: 0.7.0
- timm: 1.0.3
- tokenizers: 0.19.1
- tomlkit: 0.12.0
- torch: 2.2.1
- torchaudio: 2.2.1
- torchmetrics: 1.3.2
- torchvision: 0.17.1
- tqdm: 4.66.2
- transformers: 4.42.4
- transformers-stream-generator: 0.0.5
- triton: 2.2.0
- trl: 0.9.6
- typer: 0.12.3
- typeshed-client: 2.5.1
- typing-extensions: 4.9.0
- tyro: 0.8.5
- tzdata: 2024.1
- urllib3: 2.1.0
- uvicorn: 0.30.3
- uvloop: 0.19.0
- videollama2: 1.0
- wandb: 0.17.1
- watchfiles: 0.22.0
- wcwidth: 0.2.13
- webdataset: 0.2.93
- websockets: 11.0.3
- werkzeug: 3.0.1
- wheel: 0.41.2
- xlrd: 2.0.1
- xmltodict: 0.13.0
- xxhash: 3.4.1
- yarl: 1.9.4
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.10.13
- release: 5.15.0-56-generic
- version: ValueError with tree_unflatten when trying to read local cache #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-01T03:48:19Z

Hi! thanks for your contribution!, great first issue!

rxqy added bug Something isn't working help wanted Extra attention is needed labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training hangs with lightning ddp and cloud dir? #408

training hangs with lightning ddp and cloud dir? #408

rxqy commented Nov 1, 2024 •

edited

Loading

github-actions bot commented Nov 1, 2024

training hangs with lightning ddp and cloud dir? #408

training hangs with lightning ddp and cloud dir? #408

Comments

rxqy commented Nov 1, 2024 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Additional context

github-actions bot commented Nov 1, 2024

rxqy commented Nov 1, 2024 •

edited

Loading