Skip to content

DatasetBuilder._split_generators incomplete type annotation #6798

Closed
@JonasLoos

Description

Describe the bug

The DatasetBuilder._split_generators function has currently the following signature:

class DatasetBuilder:
    def _split_generators(self, dl_manager: DownloadManager):
        ...

However, the dl_manager argument can also be of type StreamingDownloadManager, which has different functionality. For example, the download function doesn't download, but rather just returns the given url(s).

I suggest changing the function signature to:

class DatasetBuilder:
    def _split_generators(self, dl_manager: Union[DownloadManager, StreamingDownloadManager]):
        ...

and also adjust the docstring accordingly.

I would like to create a Pull Request to fix this, and have the following questions:

  • Are there also other options than DownloadManager, and StreamingDownloadManager?
  • Should this also be changed in other functions?

Steps to reproduce the bug

Minimal example to print the different class names:

import tempfile
from datasets import load_dataset

example = b'''
from datasets import GeneratorBasedBuilder, DatasetInfo, Features, Value, SplitGenerator

class Test(GeneratorBasedBuilder):
    def _info(self):
        return DatasetInfo(features=Features({"x": Value("int64")}))
    def _split_generators(self, dl_manager):
        print(type(dl_manager))
        return [SplitGenerator('test')]
    def _generate_examples(self):
        yield 0, {'x': 42}
'''

with tempfile.NamedTemporaryFile(suffix='.py') as f:
    f.write(example)
    f.flush()
    load_dataset(f.name, streaming=False)
    load_dataset(f.name, streaming=True)

Expected behavior

complete type annotations

Environment info

/

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions