-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onquestionFurther information is requestedFurther information is requestedrepro neededThe issue is missing a reproducible exampleThe issue is missing a reproducible example
Milestone
Description
Bug description
I use webdataset with pytorch lightning. In the fact, I get a webdataset dataloader which is a IterableDataset type, and fit it into pytorch lightning trainer. It works fine under single node multi-gpu mode, but when I switch to multi-node mode. The checkpoint do not save. Anyone can help me? Thanks very much!!!!
My modelckpt config is like bellow:
default_modelckpt_cfg = {
'metrics_over_trainsteps_checkpoint':{
"target": "pytorch_lightning.callbacks.ModelCheckpoint",
"params": {
"dirpath": ckptdir,
"filename": "{step:09}",
"every_n_train_steps": 50000,
"save_top_k": -1,
}
},
}
And I test in 2 nodes, per node 2 gpus, like this:
trainer_kwargs["max_epochs"] = 2
trainer_kwargs["accelerator"] = 'gpu'
trainer_kwargs["devices"] = 2
trainer_kwargs["strategy"] = "ddp"
Then I fit the webdataset dataloader into trainer:
trainer.fit(model, train_dataloaders=data.data['train'].dataloader)
I am so confuse why just not work when I change to multi node mode, how pytorch lightning decide to save ckpt?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onquestionFurther information is requestedFurther information is requestedrepro neededThe issue is missing a reproducible exampleThe issue is missing a reproducible example