Skip to content

DataLoader ignores context manager for Data Sources #936

@jimtjames

Description

@jimtjames

Grain's documentation notes that DataLoader will try using a data source as a context manager:

Open file handles should be closed after use. Data sources typically open underlying files in order to read records from them. We recommend implementing data sources as context managers that close their open file handles within the exit method. When opening a data source, the DataLoader will first attempt to use the data source as a context manager. If the data source doesn’t implement the context manager protocol, it will be used as-is, without a with statement.

However, from my testing, it seems that DataLoader doesn't do this. See the following code snippet:

import grain.python as pygrain
from grain.sources import RandomAccessDataSource


class ContextDataset(RandomAccessDataSource):
    data: list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

    def __enter__(self):
        print('entering')
        self.data = [1,2,3,4,5,6]
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        print('exiting')
        return False


if __name__ == '__main__':
    index_sampler = pygrain.IndexSampler(
        num_records=6,
        num_epochs=1,
        shard_options=pygrain.NoSharding(),
        shuffle=False,
        seed=0)
    transformations = [pygrain.Batch(batch_size=1, drop_remainder=True)]
    data_source = ContextDataset()
    dataloader = pygrain.DataLoader(
        data_source=data_source,
        operations=transformations,
        sampler=index_sampler,
        worker_count=0)

    for data in dataloader: # errors here due to data not being initialized
        print(data)

In this example, should the datasource be accessed as a context manager, data will be populated. However, running the example raises AttributeError: 'ContextDataset' object has no attribute 'data' when calling __getitem__ by iterating over the dataloader, and entering is never printed.

I tested with the latest stable version of grain (0.2.11) on Python 3.12.7.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions