You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.
I tried several different configurations, but I'm not sure what I should check next.
GPU / Strategy / FileOn / results
1 / No DDP/ local ssd / OK
1 / No DDP/ remote(s3) / OK
8 / DDP/ local ssd / OK
8 / DDP/ remote(s3) / Stuck.
To Reproduce
I'm following the exact steps on the imagenet demo. And I write a trainer myself here.
Just run python train.py with different CUDA_VISIBLE_DEVICES is enough
Code sample
# train.py
import numpy as np
import lightning as L
import torch, torch.nn as nn, torch.utils.data as data, torchvision as tv, torch.nn.functional as F
class LitAutoEncoder(L.LightningModule):
def __init__(self):
super().__init__()
self.decoder = nn.Sequential(nn.Linear(32, 128))
def training_step(self, batch, batch_idx):
loss = self.decoder(batch).mean()
print(self.trainer.local_rank, loss)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
from lightning.data import StreamingDataset, StreamingDataLoader
class ImageNetStreaming(StreamingDataset):
def __init__(self, ):
if 1:
input_dir = "s3:// xxxxx"
cache_dir = None
else:
input_dir = "val"
cache_dir = None
max_cache_size = "200GB"
super().__init__(
input_dir = input_dir,
max_cache_size = max_cache_size,
shuffle = True,
)
def __getitem__(self, idx):
data = super().__getitem__(idx)
return np.float32(123.)
dataset = ImageNetStreaming()
dataloader = StreamingDataLoader(
dataset,
batch_size = 32,
num_workers = 2,
pin_memory = True,
shuffle = True,
drop_last = True
)
autoencoder = LitAutoEncoder()
trainer = L.Trainer()
trainer.fit(autoencoder, dataloader)
Expected behavior
Training should finish
Additional context
Due to some regulations here we can not put we data or training scirpts on lightning-studio. I'm not sure if something's wrong with our s3 bucket or our our network configuration.
One thing I notice is that even if the training stucks at some iterations(<50), we can still observe large network throughputs on our machine (around 100mb/s), but the local chunk directory( ~/.lightning/chunks) stops growing.
🐛 Bug
Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.
I tried several different configurations, but I'm not sure what I should check next.
GPU / Strategy / FileOn / results
1 / No DDP/ local ssd / OK
1 / No DDP/ remote(s3) / OK
8 / DDP/ local ssd / OK
8 / DDP/ remote(s3) / Stuck.
To Reproduce
I'm following the exact steps on the imagenet demo. And I write a trainer myself here.
Just run python train.py with different CUDA_VISIBLE_DEVICES is enough
Code sample
Expected behavior
Training should finish
Additional context
Due to some regulations here we can not put we data or training scirpts on lightning-studio. I'm not sure if something's wrong with our s3 bucket or our our network configuration.
One thing I notice is that even if the training stucks at some iterations(<50), we can still observe large network throughputs on our machine (around 100mb/s), but the local chunk directory( ~/.lightning/chunks) stops growing.
Current environment
The text was updated successfully, but these errors were encountered: