Skip to content

StatefulDataloader shuffling ignores system RNGs at initiation, in fact internal random state is identical for all (?) new dataloaders #1440

Closed
@gailweiss

Description

@gailweiss

🚀 The feature

Brief description

The random generator in newly created StatefulDataLoaders should be randomly initiated - but it is currently deterministic. This will make different DataLoaders generate different shuffles of the data, more in line with how 'normal' DataLoaders behave. Such a change is, to my understanding, not in conflict with the statefulness of the dataloaders (the generator can still be stored and loaded, there's just no reason for it to start the same way each time)

Current state

At the moment, all statefuldataloaders with shuffling on have the same initial internal RNG state, regardless of actual environment RNG states at the initiation of these two dataloaders. For example:

from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_dl(d):
    return DataLoader(d, batch_size=1, shuffle=True)

def get_generator(dl):
    return dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"]

def same_order(dl1, dl2):
    order1 = [b.item() for b in dl1]
    order2 = [b.item() for b in dl2]
    assert len(order1)>0  # not accidentally on an empty one
    return order1 == order2

dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(100)))
print("different dataloaders (on different dataset!) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))

dl1, dl2 = get_dl(list(range(10))), get_dl(list(range(10)))
print("new dataloaders on same dataset create same order?: ", same_order(dl1, dl2))

print("trying again, now forcing the environment random state to be sure")

def seed_all(seed):
    for f in [pl.seed_everything, random.seed, torch.mps.manual_seed,
              torch.manual_seed, np.random.seed]:
        f(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_all(0)
dl1 = get_dl(list(range(10)))
seed_all(1)
dl2 = get_dl(list(range(10)))

print("different dataloaders (started with different random environment) start from same RNG?:", False not in (get_generator(dl1) == get_generator(dl2)))
print("new dataloaders (started with different random environment) on same dataset create same order?: ", same_order(dl1, dl2))

Yields:

different dataloaders (on different dataset!) start from same RNG?: True
new dataloaders on same dataset create same order?:  True
trying again, now forcing the environment random state to be sure
different dataloaders (started with different random environment) start from same RNG?: True
new dataloaders (started with different random environment) on same dataset create same order?:  True

This behaviour does not seem necessary for the "statefulness" of the dataloaders, as their state_dict contains a tensor controlling the shuffles, so whichever random state they currently have can be saved and loaded as needed: new ones don't all need to start from the same random state.

Request

I normally expect new, shuffling, dataloaders to create a different shuffles from each other, and in particular to be sensitive to the environment's random state at initiation (and I assume I have this type of variety when testing runs on different random seeds).

I think this can be remedied by setting the generator tensor ( dl.state_dict()["_index_sampler_state"]["sampler_iter_state"]["generator"] ) randomly on initiation. I would use a workaround, something like this, if only I understood what the generator tensor is composed of and how to make a legal one:

# proposal that doesn't work because I don't understand what the generator is composed of, and thus do not know to make a legal one
from torchdata.stateful_dataloader import StatefulDataLoader as DataLoader

def get_generator(sd):
    return sd["_index_sampler_state"]["sampler_iter_state"]["generator"]

def set_generator(sd, t):
    sd["_index_sampler_state"]["sampler_iter_state"]["generator"] = t

def shuffling_dl_getter(d, batch_size):
    dl = DataLoader(d, batch_size=batch_size, shuffle=True)
    g = get_generator(dl.state_dict()) 
    random_initial_generator = torch.randint(g.min(), g.max(), g.shape).byte()  # unfortunately wont be accepted when it comes to use
    set_generator(sd, random_initial_generator)
    dl.load_state_dict(sd)
    return dl

d = list(range(10))
dl = shuffling_dl_getter(d, 1)  # succeeds, but then
sorted([b.item() for b in dl]) == d  # won't successfully run, complaining of an invalid mt19937 state

I would appreciate a "correct" version of the code in shuffling_dl_getter above being added to the initiation of the StatefulDataLoaders! Unfortunately I don't understand the composition of the generator tensor so I can't build a 'good' one myself. In particular I notice that g is a tensor of length 5056 with many 0s and many higher numbers, while mt19937 states should be of length 624, and I don't know what all the extra stuff is

Motivation, pitch

When testing sensitivity of an architecture or training routine to random state, I assume that the data order is being changed too (and not just the network's initial weights, and dropouts throughout training)

Alternatives

If I could, I would use the shuffling_dl_getter code described above to obtain randomly initiated StatefulDataLoaders myself, unfortunately, it is not clear to me how to make legal random states for the dataloaders

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions