Description
🚀 The feature
Brief description
The random generator in newly created StatefulDataLoaders should be randomly initiated - but it is currently deterministic. This will make different DataLoaders generate different shuffles of the data, more in line with how 'normal' DataLoaders behave. Such a change is, to my understanding, not in conflict with the statefulness of the dataloaders (the generator can still be stored and loaded, there's just no reason for it to start the same way each time)
Current state
At the moment, all statefuldataloaders with shuffling on have the same initial internal RNG state, regardless of actual environment RNG states at the initiation of these two dataloaders. For example:
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader
def get_dl(d):
return DataLoader(d, batch_size=1, shuffle=True)
def get_generator(dl):
return dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"]
def same_order(dl1, dl2):
order1 = [b.item() for b in dl1]
order2 = [b.item() for b in dl2]
assert len(order1)>0 # not accidentally on an empty one
return order1 == order2
dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(100)))
print("different dataloaders (on different dataset!) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))
dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(10)))
print("new dataloaders on same dataset create same order?: ", same_order(dl1, dl2))
print("trying again, now forcing the environment random state to be sure")
def seed_all(seed):
for f in [pl.seed_everything, random.seed, torch.mps.manual_seed,
torch.manual_seed, np.random.seed]:
f(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_all(0)
dl1 = get_dl(list(range(10)))
seed_all(1)
dl2 = get_dl(list(range(10)))
print("different dataloaders (started with different random environment) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))
print("new dataloaders (started with different random environment) on same dataset create same order?: ", same_order(dl1, dl2))
Yields:
different dataloaders (on different dataset!) start from same RNG?: True
new dataloaders on same dataset create same order?: True
trying again, now forcing the environment random state to be sure
different dataloaders (started with different random environment) start from same RNG?: True
new dataloaders (started with different random environment) on same dataset create same order?: True
This behaviour does not seem necessary for the "statefulness" of the dataloaders, as their state_dict contains a tensor controlling the shuffles, so whichever random state they currently have can be saved and loaded as needed: new ones don't all need to start from the same random state.
Request
I normally expect new, shuffling, dataloaders to create a different shuffles from each other, and in particular to be sensitive to the environment's random state at initiation (and I assume I have this type of variety when testing runs on different random seeds).
I think this can be remedied by setting the generator tensor ( dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"]
) randomly on initiation. I would use a workaround, something like this, if only I understood what the generator tensor is composed of and how to make a legal one:
# proposal that doesn't work because I don't understand what the generator is composed of, and thus do not know to make a legal one
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader
def get_generator(sd):
return sd["_index_sampler_state"]["sampler_iter_state"]["generator"]
def set_generator(sd, t):
sd["_index_sampler_state"]["sampler_iter_state"]["generator"] = t
def shuffling_dl_getter(d, batch_size):
dl = DataLoader(d, batch_size=batch_size, shuffle=True)
g = get_generator(dl.state_dict())
random_initial_generator = torch.randint(g.min(), g.max(), g.shape).byte() # unfortunately wont be accepted when it comes to use
set_generator(sd, random_initial_generator)
dl.load_state_dict(sd)
return dl
d = list(range(10))
dl = shuffling_dl_getter(d, 1) # succeeds, but then
sorted([b.item() for b in dl]) == d # won't successfully run, complaining of an invalid mt19937 state
I would appreciate a "correct" version of the code in shuffling_dl_getter
above being added to the initiation of the StatefulDataLoader
s! Unfortunately I don't understand the composition of the generator tensor so I can't build a 'good' one myself. In particular I notice that g is a tensor of length 5056 with many 0s and many higher numbers, while mt19937 states should be of length 624, and I don't know what all the extra stuff is
Motivation, pitch
When testing sensitivity of an architecture or training routine to random state, I assume that the data order is being changed too (and not just the network's initial weights, and dropouts throughout training)
Alternatives
If I could, I would use the shuffling_dl_getter
code described above to obtain randomly initiated StatefulDataLoaders myself, unfortunately, it is not clear to me how to make legal random states for the dataloaders
Additional context
No response