Skip to content

Checkpoint do not save under multinode training mode when using Webdataset #16893

@superhero-7

Description

@superhero-7

Bug description

I use webdataset with pytorch lightning. In the fact, I get a webdataset dataloader which is a IterableDataset type, and fit it into pytorch lightning trainer. It works fine under single node multi-gpu mode, but when I switch to multi-node mode. The checkpoint do not save. Anyone can help me? Thanks very much!!!!

My modelckpt config is like bellow:

    default_modelckpt_cfg = {
        'metrics_over_trainsteps_checkpoint':{
        "target": "pytorch_lightning.callbacks.ModelCheckpoint",
        "params": {
            "dirpath": ckptdir,
            "filename": "{step:09}",
            "every_n_train_steps": 50000,
            "save_top_k": -1,
        }
        },
    }

And I test in 2 nodes, per node 2 gpus, like this:

    trainer_kwargs["max_epochs"] = 2
    
    trainer_kwargs["accelerator"] = 'gpu'
    trainer_kwargs["devices"] = 2
    trainer_kwargs["strategy"] = "ddp"

Then I fit the webdataset dataloader into trainer:

trainer.fit(model, train_dataloaders=data.data['train'].dataloader)

I am so confuse why just not work when I change to multi node mode, how pytorch lightning decide to save ckpt?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked onquestionFurther information is requestedrepro neededThe issue is missing a reproducible example

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions