DatasetBuilder._split_generators
incomplete type annotation #6798
Closed
Description
Describe the bug
The DatasetBuilder._split_generators
function has currently the following signature:
class DatasetBuilder:
def _split_generators(self, dl_manager: DownloadManager):
...
However, the dl_manager
argument can also be of type StreamingDownloadManager
, which has different functionality. For example, the download
function doesn't download, but rather just returns the given url(s).
I suggest changing the function signature to:
class DatasetBuilder:
def _split_generators(self, dl_manager: Union[DownloadManager, StreamingDownloadManager]):
...
and also adjust the docstring accordingly.
I would like to create a Pull Request to fix this, and have the following questions:
- Are there also other options than
DownloadManager
, andStreamingDownloadManager
? - Should this also be changed in other functions?
Steps to reproduce the bug
Minimal example to print the different class names:
import tempfile
from datasets import load_dataset
example = b'''
from datasets import GeneratorBasedBuilder, DatasetInfo, Features, Value, SplitGenerator
class Test(GeneratorBasedBuilder):
def _info(self):
return DatasetInfo(features=Features({"x": Value("int64")}))
def _split_generators(self, dl_manager):
print(type(dl_manager))
return [SplitGenerator('test')]
def _generate_examples(self):
yield 0, {'x': 42}
'''
with tempfile.NamedTemporaryFile(suffix='.py') as f:
f.write(example)
f.flush()
load_dataset(f.name, streaming=False)
load_dataset(f.name, streaming=True)
Expected behavior
complete type annotations
Environment info
/
Metadata
Assignees
Labels
No labels