You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use a custom BatchSampler to initialize the DataLoader and use it with pytorch lightning's datamodule, I find that the shuffle settings don't take effect correctly, as evidenced by the fact that no matter how I set the sampler used to initialize the BatchSampler ( random or sequential), pytorch lightning sets the wrapped distributedsampler to the default option, i.e., shuffle by default for the training stage, and by the dataloader's sampler type for the other stages.
Analyse
The problem arises in the _is_dataloader_shuffled function (in pytorch_lightining.utilities.data) in the pytorch lightning code, where the selection of the shuffle state is based on the sampler state of the dataloader, which may seem like No problem, but in fact pytorch ignores the sampler when setting the BatchSampler (its sampler is set to the default sequential sampler), so pytorch lightning's behavior here will always get a sequential sampler which results in shuffle not working as I expected.
In fact I think the pytroch implementation is equally problematic, in the latest version of the pytorch code the dataloader property Sampler is kept mutually exclusive with BatchSampler, Shuffle etc. That is, when I use a custom BatchSampler, pytorch will only initialize a default SequentialSmapler, which is a bit counter-intuitive, but you don't get the wrong result by doing that, because pytorch chooses to use the batchsampler for data loading when it exists, and the sampler is only used when the batchsize is 1.
key code:
Suggestions
Since the problems with the pytorch code do not trigger the mentioned problem when pytorch lighting is not used, I would suggest a change to the pytorch lighting code:
before:
def_is_dataloader_shuffled(dataloader: object) ->bool:
ifhasattr(dataloader, "__pl_saved_kwargs"):
# this attribute is not part of PyTorch's DataLoader, but could have been set by# our `_replace_init_method` context managerif"shuffle"indataloader.__pl_saved_kwargs:
returndataloader.__pl_saved_kwargs["shuffle"]
if"shuffle"indataloader.__pl_saved_arg_names:
returndataloader.__pl_saved_args[dataloader.__pl_saved_arg_names.index("shuffle")]
ifhasattr(dataloader, "dataset") andisinstance(dataloader.dataset, IterableDataset):
# shuffling is useless with iterable datasetsreturnFalseifnothasattr(dataloader, "sampler"):
# shuffling is enabled via a sampler. No sampler, no shufflingreturnFalsesampler=dataloder.samplerifisinstance(sampler, SequentialSampler):
returnFalsereturnisinstance(sampler, RandomSampler)
after:
def_is_dataloader_shuffled(dataloader: object) ->bool:
ifhasattr(dataloader, "__pl_saved_kwargs"):
# this attribute is not part of PyTorch's DataLoader, but could have been set by# our `_replace_init_method` context managerif"shuffle"indataloader.__pl_saved_kwargs:
returndataloader.__pl_saved_kwargs["shuffle"]
if"shuffle"indataloader.__pl_saved_arg_names:
returndataloader.__pl_saved_args[dataloader.__pl_saved_arg_names.index("shuffle")]
ifhasattr(dataloader, "dataset") andisinstance(dataloader.dataset, IterableDataset):
# shuffling is useless with iterable datasetsreturnFalseifnothasattr(dataloader, "sampler"):
# shuffling is enabled via a sampler. No sampler, no shufflingreturnFalsebatch_sampler=dataloader.batch_samplerifbatch_samplerisnotNone:
sampler=batch_sampler.samplerelse:
sampler=dataloder.samplersampler_cls=type(sampler)
ifsampler_clsnotin (RandomSampler, SequentialSampler):
# custom sampler case:ifhasattr(sampler, "generator"):
# maybe custom random samplerreturnTrueelse:
# we don't knowreturnFalseifisinstance(sampler, SequentialSampler):
returnFalsereturnisinstance(sampler, RandomSampler)
What version are you seeing the problem on?
master
How to reproduce the bug
Firstly, define some customized BatchSampler like(or just use default BatchSampler):
Bug description
where is the bug❓
When I use a custom BatchSampler to initialize the DataLoader and use it with pytorch lightning's datamodule, I find that the shuffle settings don't take effect correctly, as evidenced by the fact that no matter how I set the sampler used to initialize the BatchSampler ( random or sequential), pytorch lightning sets the wrapped distributedsampler to the default option, i.e., shuffle by default for the training stage, and by the dataloader's sampler type for the other stages.
Analyse
The problem arises in the _is_dataloader_shuffled function (in pytorch_lightining.utilities.data) in the pytorch lightning code, where the selection of the shuffle state is based on the sampler state of the dataloader, which may seem like No problem, but in fact pytorch ignores the sampler when setting the BatchSampler (its sampler is set to the default sequential sampler), so pytorch lightning's behavior here will always get a sequential sampler which results in shuffle not working as I expected.
In fact I think the pytroch implementation is equally problematic, in the latest version of the pytorch code the dataloader property Sampler is kept mutually exclusive with BatchSampler, Shuffle etc. That is, when I use a custom BatchSampler, pytorch will only initialize a default SequentialSmapler, which is a bit counter-intuitive, but you don't get the wrong result by doing that, because pytorch chooses to use the batchsampler for data loading when it exists, and the sampler is only used when the batchsize is 1.
key code:
Suggestions
Since the problems with the pytorch code do not trigger the mentioned problem when pytorch lighting is not used, I would suggest a change to the pytorch lighting code:
before:
after:
What version are you seeing the problem on?
master
How to reproduce the bug
Firstly, define some customized BatchSampler like(or just use default BatchSampler):
Secondly, init the dataloader by BatchSampler like:
If you use the dl to init datamodule, the bug will occurred
The text was updated successfully, but these errors were encountered: